Application-oblivious diagnosis – The Database Chronicles

When diagnosing a performance problem, interpreting the flow inside of the application is as important as other aspects of the whole process, such as tools and methodology. One of the key points is to figure out what the application threads are supposed to do and how these threads communicate. Once this is established, the rest is a matter of ensuring that the figures are interpreted as they were designed to be.

The following situation was not very critical from business perspective but still was quite an interesting one. Troubleshooting scenarios like this require taking the right direction from the very beginning in the first place and asking the right questions along the way. Finding answers to these questions complements the job and typically requires a different skill-set.

The issue was related to the “hbase shell” command which simply enters the shell so that one can start submitting commands to the database. It took the shell several seconds to return.

-bash-4.2$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3.7.1.7.1000-141
Took 0.0012 seconds                                                                                                                                    
hbase:001:0>

HBase says it takes 0.012 seconds to complete which is not true (not yet sure what this figure represents but apparently it measures the time the main thread assigns the job to one of its clones – details below). It was interesting to see how much time it approximately takes to execute:

-bash-4.2$ time hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3.7.1.7.1000-141
Took 0.0013 seconds
hbase:001:0> exit

real 0m13.464s
user 0m19.437s
sys 0m1.160s

The real figure represents the (almost) actual execution time, and the user represents time spent in the user space. The output (highlighted lines) tells us that it is purely application related since the majority of the time is spent in the user-space. This is why I have to focus on the application (behavior !!!) rather than analyze the execution in the kernel layer. The first question that comes to the mind is this: “How come the real execution time is less than the time in the user-space?”. Apparently, more than one thread is involved in the command execution and the time is simply summed up across the threads which is why it is greater than the actual time.

To analyze the application behavior one can use the strace tool. This is a widely known one, although not the best one to measure performance. Still, it provides insights into application behavior since it traces the syscalls which application issues to interact with OS. strace has also flags which dumps the thread syscalls, if any. I will use the –f (for threads), – r (for timestamp) and –T(for duration) flags and rerun the “hbase shell” command this time tracing it with strace and dumping the result to a file for further analysis.

Here is the awk script which sorts the syscalls by the duration:

bash-4.2$ awk '{ gsub("<", "", $NF) ; print $NF,$0 }' stracenew.out | sort -nr | head 
9.437582> 26999 0.000083 <… futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 9.437582>
4.393135> 27570 0.000045 <… futex resumed>) = 0 4.393135>
3.998768> 27570 0.000028 <… futex resumed>) = 0 3.998768>
3.065013> 27570 0.000055 <… futex resumed>) = 0 3.065013>
2.555355> 27360 0.000321 <… futex resumed>) = 0 2.555355>
2.554154> 27359 0.000005 <… futex resumed>) = 0 2.554154>
2.551643> 27358 0.000315 <… futex resumed>) = 0 2.551643>
2.550368> 27357 0.000005 <… futex resumed>) = 0 2.550368>
2.546680> 27356 0.000009 <… futex resumed>) = 0 2.546680>
2.546620> 27355 0.000042 <… futex resumed>) = 0 2.546620>

The second column is the process/thread id. This output indicates two things: 1) it is indeed a multithreaded process and 2) slowness is the futex-related syscalls. This “futex resumed” shows that a futex syscall is still unfinished.

Futexes are synchronization mechanisms and serve the purpose of the inter-process communication: one process starts sleeping on the futex until a certain task is completed which is when the sleeping process wakes up and continues execution. That is why I now have to focus on the futex related syscalls to analyze which I will check what relevant syscalls the main process issues (the latter is this 26999 from the output which is easy to confirm from the dump):

bash-4.2$ grep 26999 stracenew | grep futex
26999 0.000049 futex(0x7f78bff870d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000008>
26999 0.000035 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000083 <… futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <9.437582>
26999 0.000025 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000009 <… futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.105477>
26999 0.000009 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
27387 0.000014 futex(0x7f78b80eb554, FUTEX_WAIT_BITSET_PRIVATE, 3605, {tv_sec=1403988, tv_nsec=482699946}, 0xffffffff
26999 0.000007 <… futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <1.962090>
26999 0.000014 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000024 <… futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.908004>
26999 0.000022 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000010 <… futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.038960>
26999 0.000009 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL

The thread 27387 is not what I am looking for. In all other cases futex syscalls complete fast but this one “futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL <unfinished…>” not yet finished. When the second parameter of the futex syscall is FUTEX_WAIT, it will check if the value at the address 0x7f78c05ce9d0 still contains the value of 27352 in which case it blocks until (if the fourth parameter is NULL) the FUTEX_WAKE operation is done on it by another futex syscall. This indicates one interesting point: “Apparently, the FUTEX_WAKE futex syscall completes late, otherwise, the main thread which is the issuer of the FUTEX_WAIT would have completed its job and return“.

The game-changing question now becomes this: “Which of the threads is responsible for the FUTEX_WAKE futex syscall?” If this is found, then what would remain is to analyze what this particular thread is busy with which causes it to issue the the futex syscall with FUTEX_WAKE late.

So, the futex word is 27352 and whoever the FUTEX_WAKE issuer is, it is this thread which will updated the futex word at the 0x7f78c05ce9d0 address so that the FUTEX_WAIT issue can detect it and resume… and 27352 very much resembles a thread id. If it is indeed so, then there has to be the clone/fork syscall issued by the main process 26999:

bash-4.2$ grep 27352 stracenew | grep 26999 | grep clone 
26999      0.000039 clone(child_stack=0x7f78c05cdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f78c05ce9d0, tls=0x7f78c05ce700, child_tidptr=0x7f78c05ce9d0) = 27352 <0.000058>

Gotten. The clone syscall returns the thread id.

If the main process waits for the futex word to become 27352 and this is the id of one of its child process, then the main one must have assigned the job to exactly this child and the child is responsible for issuing the FUTEX_WAKE. Thus, it is this thread 27352 which comes back late which can happen for one of the either reasons: it issues excessive number of syscalls or has a couple of syscalls that take the most amount of time.

Let’s check out the first case. The script below will group by the syscall name and print number of occurrences for each:

awk '{ syscall_name = substr($3, 1, index($3, "(") - 1); syscall_count[syscall_name]++ } END { for (syscall_name in syscall_count) order[syscall_count[syscall_name], syscall_name] = syscall_name; PROCINFO["sorted_in"] = "@ind_num_desc"; i = 0; for (j in order) { print order[j], syscall_count[order[j]]; i++; if (i == 10) break; } }' stracenew
stat 330179
futex 49207
read 25507
lseek 23337
lstat 17729
rt_sigprocmask 6369
close 5688
mprotect 5046
mmap 2425

The thread 27352 issues lots of stat syscalls for some reason. The total number of the syscalls in the dump is 425896. So, 330179/425896 ~ 78% of the job is on this guy 27352. These are the sample calls:

27352      0.000057 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/resources.jar", {st_mode=S_IFREG|0644, st_size=3505346, ...}) = 0 <0.000011>
27352      0.000052 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/rt.jar", {st_mode=S_IFREG|0644, st_size=66339559, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/sunrsasign.jar", 0x7f78c05cd920) = -1 ENOENT (No such file or directory) <0.000010>
27352      0.000044 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jsse.jar", {st_mode=S_IFREG|0644, st_size=660664, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jce.jar", {st_mode=S_IFREG|0644, st_size=115651, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/charsets.jar", {st_mode=S_IFREG|0644, st_size=3135615, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jfr.jar", {st_mode=S_IFREG|0644, st_size=560713, ...}) = 0 <0.000011>

The stat syscall returns attributes of a file. This does not have to involve an I/O.

The next question is “Why the app issues so many syscalls to complete a single command?“

The fix is another story. The issue is diagnosed !!!

Leave a Reply Cancel reply