Application-oblivious diagnosis

When diagnosing a performance problem, interpreting the flow inside of the application is as important as other aspects of the whole process, such as tools and methodology. One of the key points is to figure out what the application threads are supposed to do and how these threads communicate. Once this is established, the rest is a matter of ensuring that the figures are interpreted as they were designed to be.

The following situation was not very critical from business perspective but still was quite an interesting one. Troubleshooting scenarios like this require taking the right direction from the very beginning in the first place and asking the right questions along the way. Finding answers to these questions complements the job and typically requires a different skill-set.

The issue was related to the “hbase shell” command which simply enters the shell so that one can start submitting commands to the database. It took the shell several seconds to return.

-bash-4.2$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3.7.1.7.1000-141
Took 0.0012 seconds                                                                                                                                    
hbase:001:0>

HBase says it takes 0.012 seconds to complete which is not true (not yet sure what this figure represents but apparently it measures the time the main thread assigns the job to one of its clones – details below). It was interesting to see how much time it approximately takes to execute:

-bash-4.2$ time hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3.7.1.7.1000-141
Took 0.0013 seconds
hbase:001:0> exit

real 0m13.464s
user 0m19.437s
sys 0m1.160s

The real figure represents the (almost) actual execution time, and the user represents time spent in the user space. The output (highlighted lines) tells us that it is purely application related since the majority of the time is spent in the user-space. This is why I have to focus on the application (behavior !!!) rather than analyze the execution in the kernel layer. The first question that comes to the mind is this: “How come the real execution time is less than the time in the user-space?”. Apparently, more than one thread is involved in the command execution and the time is simply summed up across the threads which is why it is greater than the actual time.

To analyze the application behavior one can use the strace tool. This is a widely known one, although not the best one to measure performance. Still, it provides insights into application behavior since it traces the syscalls which application issues to interact with OS. strace has also flags which dumps the thread syscalls, if any. I will use the –f (for threads), – r (for timestamp) and –T(for duration) flags and rerun the “hbase shell” command this time tracing it with strace and dumping the result to a file for further analysis.

Here is the awk script which sorts the syscalls by the duration:

bash-4.2$ awk '{ gsub("<", "", $NF) ; print $NF,$0 }' stracenew.out | sort -nr | head 
9.437582> 26999 0.000083 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 9.437582>
4.393135> 27570 0.000045 < futex resumed>) = 0 4.393135>
3.998768> 27570 0.000028 < futex resumed>) = 0 3.998768>
3.065013> 27570 0.000055 < futex resumed>) = 0 3.065013>
2.555355> 27360 0.000321 < futex resumed>) = 0 2.555355>
2.554154> 27359 0.000005 < futex resumed>) = 0 2.554154>
2.551643> 27358 0.000315 < futex resumed>) = 0 2.551643>
2.550368> 27357 0.000005 < futex resumed>) = 0 2.550368>
2.546680> 27356 0.000009 < futex resumed>) = 0 2.546680>
2.546620> 27355 0.000042 < futex resumed>) = 0 2.546620>

The second column is the process/thread id. This output indicates two things: 1) it is indeed a multithreaded process and 2) slowness is the futex-related syscalls. This “futex resumed” shows that a futex syscall is still unfinished.

Futexes are synchronization mechanisms and serve the purpose of the inter-process communication: one process starts sleeping on the futex until a certain task is completed which is when the sleeping process wakes up and continues execution. That is why I now have to focus on the futex related syscalls to analyze which I will check what relevant syscalls the main process issues (the latter is this 26999 from the output which is easy to confirm from the dump):

bash-4.2$ grep 26999 stracenew | grep futex
26999 0.000049 futex(0x7f78bff870d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000008>
26999 0.000035 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000083 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <9.437582>
26999 0.000025 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000009 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.105477>
26999 0.000009 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
27387 0.000014 futex(0x7f78b80eb554, FUTEX_WAIT_BITSET_PRIVATE, 3605, {tv_sec=1403988, tv_nsec=482699946}, 0xffffffff
26999 0.000007 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <1.962090>
26999 0.000014 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000024 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.908004>
26999 0.000022 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000010 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.038960>
26999 0.000009 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL

The thread 27387 is not what I am looking for. In all other cases futex syscalls complete fast but this one “futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL <unfinished…>” not yet finished. When the second parameter of the futex syscall is FUTEX_WAIT, it will check if the value at the address 0x7f78c05ce9d0 still contains the value of 27352 in which case it blocks until (if the fourth parameter is NULL) the FUTEX_WAKE operation is done on it by another futex syscall. This indicates one interesting point: “Apparently, the FUTEX_WAKE futex syscall completes late, otherwise, the main thread which is the issuer of the FUTEX_WAIT would have completed its job and return“.

The game-changing question now becomes this: “Which of the threads is responsible for the FUTEX_WAKE futex syscall?” If this is found, then what would remain is to analyze what this particular thread is busy with which causes it to issue the the futex syscall with FUTEX_WAKE late.

So, the futex word is 27352 and whoever the FUTEX_WAKE issuer is, it is this thread which will updated the futex word at the 0x7f78c05ce9d0 address so that the FUTEX_WAIT issue can detect it and resume… and 27352 very much resembles a thread id. If it is indeed so, then there has to be the clone/fork syscall issued by the main process 26999:

bash-4.2$ grep 27352 stracenew | grep 26999 | grep clone 
26999      0.000039 clone(child_stack=0x7f78c05cdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f78c05ce9d0, tls=0x7f78c05ce700, child_tidptr=0x7f78c05ce9d0) = 27352 <0.000058>

Gotten. The clone syscall returns the thread id.

If the main process waits for the futex word to become 27352 and this is the id of one of its child process, then the main one must have assigned the job to exactly this child and the child is responsible for issuing the FUTEX_WAKE. Thus, it is this thread 27352 which comes back late which can happen for one of the either reasons: it issues excessive number of syscalls or has a couple of syscalls that take the most amount of time.

Let’s check out the first case. The script below will group by the syscall name and print number of occurrences for each:

awk '{ syscall_name = substr($3, 1, index($3, "(") - 1); syscall_count[syscall_name]++ } END { for (syscall_name in syscall_count) order[syscall_count[syscall_name], syscall_name] = syscall_name; PROCINFO["sorted_in"] = "@ind_num_desc"; i = 0; for (j in order) { print order[j], syscall_count[order[j]]; i++; if (i == 10) break; } }' stracenew
stat 330179
futex 49207
read 25507
lseek 23337
lstat 17729
rt_sigprocmask 6369
close 5688
mprotect 5046
mmap 2425

The thread 27352 issues lots of stat syscalls for some reason. The total number of the syscalls in the dump is 425896. So, 330179/425896 ~ 78% of the job is on this guy 27352. These are the sample calls:

27352      0.000057 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/resources.jar", {st_mode=S_IFREG|0644, st_size=3505346, ...}) = 0 <0.000011>
27352      0.000052 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/rt.jar", {st_mode=S_IFREG|0644, st_size=66339559, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/sunrsasign.jar", 0x7f78c05cd920) = -1 ENOENT (No such file or directory) <0.000010>
27352      0.000044 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jsse.jar", {st_mode=S_IFREG|0644, st_size=660664, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jce.jar", {st_mode=S_IFREG|0644, st_size=115651, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/charsets.jar", {st_mode=S_IFREG|0644, st_size=3135615, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jfr.jar", {st_mode=S_IFREG|0644, st_size=560713, ...}) = 0 <0.000011>

The stat syscall returns attributes of a file. This does not have to involve an I/O.

The next question is “Why the app issues so many syscalls to complete a single command?

The fix is another story. The issue is diagnosed !!!

Reasoning about OS Scheduler for Performance Degradation Analysis

It all started with a performance issue in one of our clusters. One of the components on the critical path turned out to be the OS scheduler which I analyzed to find out whether it was the offender itself or just a symptom, and the global slowness was caused by other factors in the stack. The whole story is very long and would take (I guess) several pages or so to describe everything. That is why I will just present here a short exert which is a part of the whole diagnosis journey. So, don’t be surprised to notice that the story does not (sort of) have neither the beginning nor the end.

I decided to peek at the OS scheduler as vmstat reported high run queue lengths. This led me to conclude that the former will highly likely bring me to the ultimate offender. The scheduler statistics can be traced with the perf command which has many options. Since the issue was ongoing for several minutes, few tens of seconds would suffice. Moreover, as I didn’t know which scheduler event should be focused upon, I traced all of them:

[fgenceliyev@machine1 ~]$ sudo perf sched record -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 50.183 MB perf.data (132,856 samples) ]

The next step was to check whether what I suspected was true and I was going in the right direction (the output is intentionally shortened):

[fgenceliyev@machine1 ~]$ sudo perf sched latency --sort max

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  MutationStage-1:(2)   |    240.607 ms |    13933 | avg:    0.083 ms | max:  416.618 ms | max at: 275770.668504 s
  C2 CompilerThre:(6)   |   9980.936 ms |     2010 | avg:    0.934 ms | max:  322.393 ms | max at: 275772.038103 s
  MutationStage-8:17873 |     50.710 ms |     1743 | avg:    0.248 ms | max:  310.998 ms | max at: 275771.587630 s
  MutationStage-5:17870 |     62.694 ms |     2213 | avg:    0.347 ms | max:  310.980 ms | max at: 275771.587612 s
  Thread-17:19000       |     49.637 ms |      783 | avg:    0.494 ms | max:  310.795 ms | max at: 275771.587998 s
  vmtoolsd:1972         |     17.737 ms |       73 | avg:    4.401 ms | max:  287.086 ms | max at: 275769.602360 s
  ................................................................................................................
  ................................................................................................................
  ................................................................................................................

No comments here: huge delays ! The are two good news here though: a) the output has a timestamp that shows when the maximum delay happened for a particular process (or thread) and b) it doesn’t matter which process from the output we now chase to drill down further since (whatever it is) all of them suffer from the same.

At this point, we know the delay is there but we don’t know where that delay occurs. In other words, we now need to split the delay into its parts and see where exactly time is spent most (we can take an arbitrary process to measure).

Let’s take the very first timestamp (275770.668504) for analysis and see how time is distributed. This can be achieved by using timehist option of perf which gives a breakdown of the events.

[fgenceliyev@machine1 ~]$ sudo perf sched timehist -f
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  275767.383658 [0001]  cluster1-nio-wo[18972/18944]        0.000      0.000      0.000 
  275767.383663 [0002]  <idle>                              0.000      0.000      0.000 
  275767.383664 [0000]  <idle>                              0.000      0.000      0.000 
  275767.383670 [0002]  MutationStage-5[17870/17693]        0.000      0.000      0.006 
  275767.383670 [0000]  MutationStage-1[17803/17693]        0.000      0.000      0.006 
  275767.383696 [0003]  cluster1-nio-wo[18970/18944]        0.000      0.000      0.000 
  275767.383697 [0001]  Thread-6[18989/18944]               0.000      0.000      0.039 
  275767.383699 [0005]  <idle>                              0.000      0.000      0.000
  .....................................................................................
  .....................................................................................
  .....................................................................................
  275770.668504 [0004]  cluster1-nio-wo[18971/18944]      416.371      0.277      0.254
  .....................................................................................
  .....................................................................................
  .....................................................................................

We are interested in two columns: wait time and sch delay. The first is the time a task was waiting to be woken up and the second is the time from the moment the context switch begins up to the point a task is scheduled in. The latter is the scheduler latency i.e. time it takes the scheduler to change the task’s status to Runnable so that it can then be switched to Running.

What we can see (line 16) is that the scheduler latency is very high: 0.277 msecs but what’s even worse is that the task has been waiting for 416 msecs just to be picked up by the scheduler. Such a situation is a little tricky as it resembles the ‘chicken and egg problem’ since either can cause the other one: the scheduler’s slow decision could be because of the same reasons that the task had to wait long and, conversely, high scheduler latencies can cause significant delays before a task receives a wake-up signal. Without any evidence we cannot judge neither to be the cause. If we don’t have enough evidence (and sometimes we don’t :)) ) we should be able to draw conclusions like what is more and what is less likely.

Since we don’t know which caused the other one we should attack each of them separately as if we don’t have the other issue at all. Thus, the first question is “What affects latency of scheduler decisions?”

Recall that the OS scheduler is just another kernel thread and to do its job it needs CPU time. If the CPU is saturated (for example, by doing heavy I/O or doing moderate I/O from a slow device) then the scheduler’s behavior can be affected and, hence, its latency may rise.

This is a three-node Apache Cassandra cluster. For each type of tasks (such as handling client requests, updating the tables and etc.) the database maintains a pool of threads each of which performs its job. We need to see what those threads are doing and, apparently, that will tell us the actions that burn the CPU (or which are slow). To see the breakdown by threads we will use the psn (Tanel Poder’s process snapper)

=== Active Threads =============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm2             | state                  | wchan                        | kstack                                                                                                                                                                                                                                                                                                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      32 |        4.00 | (Native-Transpor) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      20 |        2.50 | (MutationStage-1) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      19 |        2.38 | (Native-Transpor) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      18 |        2.25 | (Native-Transpor) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      16 |        2.00 | (java)            | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       8 |        1.00 | (MutationStage-1) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       7 |        0.88 | (Native-Transpor) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       6 |        0.75 | (Native-Transpor) | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->hrtimer_cancel()->hrtimer_try_to_cancel()                                                                                                                                                                                                                                                                                                                                                        
       5 |        0.62 | (MutationStage-6) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       5 |        0.62 | (epollEventLoopG) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (MutationStage-9) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       4 |        0.50 | (Native-Transpor) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (java)            | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                                                            
       3 |        0.38 | (MutationStage-1) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       3 |        0.38 | (Native-Transpor) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-1) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-2) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       2 |        0.25 | (MutationStage-7) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       2 |        0.25 | (MutationStage-8) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       1 |        0.12 | (MessagingServic) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
  ................................................................................................................
  ................................................................................................................
  ................................................................................................................

The comm2 column shows a thread from the thread pool and the state column shows what that thread was doing at the time of sampling. What is clear (line 59) is that a number of threads is “ON CPU” but what also catches eye is that quite a portion of them are waiting for disk I/O to complete (Disk (Uninterruptible) state). Here I’m showing just one sample and not going into details of the values in wchain and kstack columns. Sampling the threads several times shows the same behavior. For a database like Cassandra (if configured properly) we should not observe many threads spending their time on I/O.

This leads us to suspect that the storage subsystem is not catching up. The iostat output is to the rescue (for the following two outputs I got only screens saved aside):

It takes 53 msecs to write 4.77 MB of data! Plus, avgqu-sz is low, thus, there is no queueing up. The next contact is the storage team.

Let’s also see bcc-tools output just for diversity. The last column is latency in milliseconds and the third column from the end is the amount of data (in bytes) written: 512 KB gets written in 0.4 secs !

The whole investigation journey could have been much shorter should we simply had run iostat and analyze the I/O latency. But that would be just a ‘matter of luck’. Even if the path is long, if you take the right direction from the very beginning it will definitely take you to the cause.

This is it for now! Hope you enjoyed !

A potential method for determining the join method without explain plan

I will not pretend that the idea presented in this post is aimed at addressing a number of troubleshooting scenarios but in some cases it may come in handy. For example, imagine a situation such that you cannot log in to the database because of a long-running query which makes it unresponsive. However, the OS is running fine and can be worked with.

The simple and not-yet-so-mature approach presented here should help one judge about the potential join method that a query is performing. At least, it may give certain (though incomplete) ground to reason about it.

Given all other conditions are same, the Nested Loop join method is the one that will execute the highest number of iterations to fetch the requested tuples. The other two methods, Hash Join and Sort-Merge Join will face less number of iterations and, perhaps, less number of loops.

The scenario with the assumptions above is not a rare. However, the question can be “Once you start suspecting that it is most probably the Nested Loop join that’s causing the hang, how can it help?”. At least, one can tell the software developers to fix the query by changing the join method which should solve the issue. Since (as assumed) the database is unresponsive, one cannot peek at the explain plan and tell for sure. What can help though is “expressing the join method via loops”.

To do this, let’s step aside a little.

When the compiler converts a loop in high-level code (while/for) to assembly, it maps it to a Branch instruction. The branch instruction updates the contents of the Program Counter (PC; also known as Extended Instruction Pointer in x86) register by the address stored in the Branch Target Address (BTA). On the next cycle the CPU fetches the instruction from the PC (which is the address of the instruction pointed to by the branch instruction) and executes it. To illustrate it, here is an assembly example for ARM processors:

.global main

main:
        mov     r0, #0     /* setting up initial variable a */
loop:   cmp     r0, #4     /* checking if a==4 */
        beq     end        /* proceeding to the end if a==4 */
        add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
        b loop             /* repeating the loop */
end:
        bx lr              /* THE END */

Here, b is a branch instruction with the label loop. The label loop points to the instruction cmp (compare). When the CPU reaches the b loop (line 8) instruction, the PC’s contents are updated by the value stored in the loop label which is nothing but the address of the cmp instruction. On the next cycle when the PC is fetched, the address of the cmp (line 5) is returned and the instruction is fetched by its address and executed. In this way, the branch instruction “makes a jump”. This is how a loop iterates over.

Thus, a high number of branch instructions can be an indicator of an excessive execution of iterations, in particular, a Nested Loop join.

The next step is finding some OS statistics that shows the number of branch instructions executing in real time. This is where the perf tool comes into play.

The perf tool is a performance profiler that can display various hardware and software statistics and counters. It can also be attached to a running process. For example, as follows:

# perf stat -p 4539
       1920.159821 task-clock                #    0.991 CPUs utilized          
                13 context-switches          #    0.007 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
               258 page-faults               #    0.134 K/sec                  
     5,649,595,479 cycles                    #    2.942 GHz                     [83.43%]
     1,808,339,931 stalled-cycles-frontend   #   32.01% frontend cycles idle    [83.54%]
     1,171,884,577 stalled-cycles-backend    #   20.74% backend  cycles idle    [66.77%]
     8,625,207,199 instructions              #    1.53  insns per cycle        
                                             #    0.21  stalled cycles per insn [83.51%]
     1,488,797,176 branches                  #  775.351 M/sec                   [82.58%]
        53,395,139 branch-misses             #    3.59% of all branches         [83.78%]

       1.936842598 seconds time elapsed

The command populates statistics until cancelled (CTRL+C). So, the perf command shows that by that time 1,488,797,176 branch instructions were executed out of which 53,395,139 the branch predictor failed to predict. The former is actually is the main metric that is of interest. If a high number of deltas is observed with the branches metric, then this indicates that the code is executing high number of iterations. This in turn, can direct us to suspect that (probably) a Nested Loop join method is running intensively. Peeking at this metric sampled over time and seeing that this number is unreasonably high can give a tip that the code needs to be checked for a loop that is iterating over and over again.

Side note: The Nested Loop join is not the only code path that can cause this. This can also be, for example, due to latch contention since a retry logic is an essential part of pretty much any latch implementation. Another example can be index scan: when the code iteratively walks through the index to find the key which (by the way) can be a part of the Nested Loop join. The approach presented should be applicable to any RDBMS.

In the next post, I will reproduce a highly iterative Nested Loop join and will analyze the metric mentioned. This will give a better insight into improving the idea briefly presented.

Feeding intelligence into synchronization primitives

This is a brief overview of my recent paper where I am reasoning about application of machine learning to synchronization methods, in particular, to spinlock protocols. Research around spinlocks is quite extensive, although machine learning techniques there are rare. As to my knowledge, there has been just one work so far that has been dealing with integration of machine leaning into spinlocks: Smartlocks: Self-aware Synchronization through Lock Acquisition Scheduling developed at MIT in 2009 by great researchers. The main idea of this work is to use performance metrics of the application that are measured via an API that sends heartbeats to the machine learning engine. The engine, then, works out an optimal policy which the threads utilize to intelligently organize their behavior.

My focus is on hybrid methods where threads should decide whether they should spin or park out. This decision is hard to make because of unpredictability of the system load (which is very crucial for the application performance) and number of threads that can ever contend for the lock. Perhaps, a way out can be to let the thread learn and decide themselves rather than hard code it. But they should decide about it to make it good both for themselves (to acquire the lock as fast as possible by burning as less as possible CPU cycles) and for the system (avoid making the system too oversubscribed). This sounds just like Russel Crowe starring as John Nash in “Beautiful Mind” exclaims “The best for the group comes when everyone in the group doing what’s best for himself and the group” :))). The feedback which can, for example, be modeled as wasted CPU cycles and/or scheduler load can be used to train the threads i.e. a thread sleeps for an amount of time, if lock is passed by, then the thread is punished, otherwise rewarded. All details are there in the paper.

Previously, I suggested to use machine learning techniques to eliminate some issues associated with spinlock protocols that employ blocking method as well. But the model that I presented left much room for improvement. This time, I tried to generalize it further and extend it to the case of the spin-then-park method as well (when submitted, a thread does just block or just spin but spins for a while then parks itself out).  Moreover, I’ve refined the model itself too. The full text can be found here. Hope you will enjoy.

A New Paper on Spinlocks meeting Machine Learning

This is not performance troubleshooting post but in some sense related to that. I’ve been working on a paper where I suggest to apply reinforcement learning method of machine learning to design of spin-then-block strategy. The default behavior of latches and mutexes in recent versions of Oracle is, in fact, one implementation of the spin-then-block method. The primary idea of the paper is to have threads learn the cases where it should prefer either sleeping or spinning. The objective is to minimize cost associated with lock acquisition. The paper is out now. Unfortunately, I couldn’t make time to set up experiments to test this. However, the paper develops a self-tuning machine learning based framework that can wrap spin-then-block method. Obviously, some challenges are still there. The journal is indexed in such scientific databases as Scopus, Web of Science (ESCI) and etc. The paper is available here. Hope you will find it interesting.

Intelligent Database Management Systems

Machine Learning (ML) has been around for decades. Lots of applications of it can be provided. The idea here is to feed the agent with the ability to learn by itself.  We don’t program all possible if-then-else cases but rather provide samples and the entity learns how to behave from this point on (supervised learning), or drop the entity in to an environment and then punish or reward it based on its behavior and it becomes smarter as it hits its head against the wall oftener and oftener (reinforcement leaning), or lastly given collection of data, the agent thinks about properly classifying it or splitting them into possible groups (unsupervised learning) that share similar properties. Take a look at this to see how a robot learns to flip pancakes after 50 trials. It does impress.

Database Management Systems (DBMS) have some candidate modules for integrating with ML. Query Optimizer can serve as an example of it. A query may have multiple plans for execution. The job of the optimizer is to pick up a plan with the least cost. Cost may include many metrics, such as number of I/O or number of rows processed (for In-Memory DBMS). We can treat the optimizer as agent/robot. And the system as the environment. Cost can serve as reward. The lesser the cost, the more we reward the optimizer. So, even though this time a particular cost was picked up, the optimizer will try to improve it the next time. In this way, the optimizer can learn from its behavior itself rather than being programmed for most of possible cases.

Harder to do than said. In theory, this would work. In practice….it needs a careful design and lots of experimentation. The hardest part would be to map optimizer’s behavior onto reinforcement learning paradigm. Environments are so many and very different.

Computer science associate professor in MIT Tim Kraska briefly speaks about how DBAs can be released from index management in databases if machine learning used. Once data distribution changes, it could be worth to reconsider indexes. Index management module can learn to pick up columns to be indexed as data in the database changes over time.

Recently, group of researchers and students at Carnegie Mellon University have developed a tool named OtterTune that aims at simplifying DBMS deployment. The tool has a repository that keeps data from historical tuning sessions. This experience is used then by the ML engine to develop new, appropriate recommendations for new configuration depending on application objective. In the corresponding paper, authors conduct experiments on three DBMSs: MySQL, Postgres and Vector.

Oracle introduced some adaptive features in 12c which is a step towards intelligent tuning. One of them is adaptive execution plan feature which, briefly put, works as follows. An SQL statement is executed with an initially generated plan and candidate plans are kept in the cursor of the query. Say, for a join in the query Nested Loop (NL) is the initial one. So, when the execution reaches this join point, rows are buffered and compared with initial estimates. If comparison reaches certain threshold, then the NL is replaced with Hash Join which would be a smarter decision for this case. Thus, a join method switch happens on the fly, at run time. Execution considers feedback. Subsequent executions of the same SQL query then start using this plan from then on.

Database Management Systems leave a large room for application of machine learning methods. Candidates can be query optimization, database recovery and checkpointing, memory management and instance tuning. Along with database knowledge, ML skills for future DBAs might be a plus.