A potential method for determining the join method without explain plan

I will not pretend that the idea presented in this post is aimed at addressing a number of troubleshooting scenarios but in some cases it may come in handy. For example, imagine a situation such that you cannot log in to the database because of a long-running query which makes it unresponsive. However, the OS is running fine and can be worked with.

The simple and not-yet-so-mature approach presented here should help one judge about the potential join method that a query is performing. At least, it may give certain (though incomplete) ground to reason about it.

Given all other conditions are same, the Nested Loop join method is the one that will execute the highest number of iterations to fetch the requested tuples. The other two methods, Hash Join and Sort-Merge Join will face less number of iterations and, perhaps, less number of loops.

The scenario with the assumptions above is not a rare. However, the question can be “Once you start suspecting that it is most probably the Nested Loop join that’s causing the hang, how can it help?”. At least, one can tell the software developers to fix the query by changing the join method which should solve the issue. Since (as assumed) the database is unresponsive, one cannot peek at the explain plan and tell for sure. What can help though is “expressing the join method via loops”.

To do this, let’s step aside a little.

When the compiler converts a loop in high-level code (while/for) to assembly, it maps it to a Branch instruction. The branch instruction updates the contents of the Program Counter (PC; also known as Extended Instruction Pointer in x86) register by the address stored in the Branch Target Address (BTA). On the next cycle the CPU fetches the instruction from the PC (which is the address of the instruction pointed to by the branch instruction) and executes it. To illustrate it, here is an assembly example for ARM processors:

.global main

main:
        mov     r0, #0     /* setting up initial variable a */
loop:   cmp     r0, #4     /* checking if a==4 */
        beq     end        /* proceeding to the end if a==4 */
        add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
        b loop             /* repeating the loop */
end:
        bx lr              /* THE END */

Here, b is a branch instruction with the label loop. The label loop points to the instruction cmp (compare). When the CPU reaches the b loop (line 8) instruction, the PC’s contents are updated by the value stored in the loop label which is nothing but the address of the cmp instruction. On the next cycle when the PC is fetched, the address of the cmp (line 5) is returned and the instruction is fetched by its address and executed. In this way, the branch instruction “makes a jump”. This is how a loop iterates over.

Thus, a high number of branch instructions can be an indicator of an excessive execution of iterations, in particular, a Nested Loop join.

The next step is finding some OS statistics that shows the number of branch instructions executing in real time. This is where the perf tool comes into play.

The perf tool is a performance profiler that can display various hardware and software statistics and counters. It can also be attached to a running process. For example, as follows:

# perf stat -p 4539
       1920.159821 task-clock                #    0.991 CPUs utilized          
                13 context-switches          #    0.007 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
               258 page-faults               #    0.134 K/sec                  
     5,649,595,479 cycles                    #    2.942 GHz                     [83.43%]
     1,808,339,931 stalled-cycles-frontend   #   32.01% frontend cycles idle    [83.54%]
     1,171,884,577 stalled-cycles-backend    #   20.74% backend  cycles idle    [66.77%]
     8,625,207,199 instructions              #    1.53  insns per cycle        
                                             #    0.21  stalled cycles per insn [83.51%]
     1,488,797,176 branches                  #  775.351 M/sec                   [82.58%]
        53,395,139 branch-misses             #    3.59% of all branches         [83.78%]

       1.936842598 seconds time elapsed

The command populates statistics until cancelled (CTRL+C). So, the perf command shows that by that time 1,488,797,176 branch instructions were executed out of which 53,395,139 the branch predictor failed to predict. The former is actually is the main metric that is of interest. If a high number of deltas is observed with the branches metric, then this indicates that the code is executing high number of iterations. This in turn, can direct us to suspect that (probably) a Nested Loop join method is running intensively. Peeking at this metric sampled over time and seeing that this number is unreasonably high can give a tip that the code needs to be checked for a loop that is iterating over and over again.

Side note: The Nested Loop join is not the only code path that can cause this. This can also be, for example, due to latch contention since a retry logic is an essential part of pretty much any latch implementation. Another example can be index scan: when the code iteratively walks through the index to find the key which (by the way) can be a part of the Nested Loop join. The approach presented should be applicable to any RDBMS.

In the next post, I will reproduce a highly iterative Nested Loop join and will analyze the metric mentioned. This will give a better insight into improving the idea briefly presented.