Application-oblivious diagnosis

When diagnosing a performance problem, interpreting the flow inside of the application is as important as other aspects of the whole process, such as tools and methodology. One of the key points is to figure out what the application threads are supposed to do and how these threads communicate. Once this is established, the rest is a matter of ensuring that the figures are interpreted as they were designed to be.

The following situation was not very critical from business perspective but still was quite an interesting one. Troubleshooting scenarios like this require taking the right direction from the very beginning in the first place and asking the right questions along the way. Finding answers to these questions complements the job and typically requires a different skill-set.

The issue was related to the “hbase shell” command which simply enters the shell so that one can start submitting commands to the database. It took the shell several seconds to return.

-bash-4.2$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3.7.1.7.1000-141
Took 0.0012 seconds                                                                                                                                    
hbase:001:0>

HBase says it takes 0.012 seconds to complete which is not true (not yet sure what this figure represents but apparently it measures the time the main thread assigns the job to one of its clones – details below). It was interesting to see how much time it approximately takes to execute:

-bash-4.2$ time hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3.7.1.7.1000-141
Took 0.0013 seconds
hbase:001:0> exit

real 0m13.464s
user 0m19.437s
sys 0m1.160s

The real figure represents the (almost) actual execution time, and the user represents time spent in the user space. The output (highlighted lines) tells us that it is purely application related since the majority of the time is spent in the user-space. This is why I have to focus on the application (behavior !!!) rather than analyze the execution in the kernel layer. The first question that comes to the mind is this: “How come the real execution time is less than the time in the user-space?”. Apparently, more than one thread is involved in the command execution and the time is simply summed up across the threads which is why it is greater than the actual time.

To analyze the application behavior one can use the strace tool. This is a widely known one, although not the best one to measure performance. Still, it provides insights into application behavior since it traces the syscalls which application issues to interact with OS. strace has also flags which dumps the thread syscalls, if any. I will use the –f (for threads), – r (for timestamp) and –T(for duration) flags and rerun the “hbase shell” command this time tracing it with strace and dumping the result to a file for further analysis.

Here is the awk script which sorts the syscalls by the duration:

bash-4.2$ awk '{ gsub("<", "", $NF) ; print $NF,$0 }' stracenew.out | sort -nr | head 
9.437582> 26999 0.000083 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 9.437582>
4.393135> 27570 0.000045 < futex resumed>) = 0 4.393135>
3.998768> 27570 0.000028 < futex resumed>) = 0 3.998768>
3.065013> 27570 0.000055 < futex resumed>) = 0 3.065013>
2.555355> 27360 0.000321 < futex resumed>) = 0 2.555355>
2.554154> 27359 0.000005 < futex resumed>) = 0 2.554154>
2.551643> 27358 0.000315 < futex resumed>) = 0 2.551643>
2.550368> 27357 0.000005 < futex resumed>) = 0 2.550368>
2.546680> 27356 0.000009 < futex resumed>) = 0 2.546680>
2.546620> 27355 0.000042 < futex resumed>) = 0 2.546620>

The second column is the process/thread id. This output indicates two things: 1) it is indeed a multithreaded process and 2) slowness is the futex-related syscalls. This “futex resumed” shows that a futex syscall is still unfinished.

Futexes are synchronization mechanisms and serve the purpose of the inter-process communication: one process starts sleeping on the futex until a certain task is completed which is when the sleeping process wakes up and continues execution. That is why I now have to focus on the futex related syscalls to analyze which I will check what relevant syscalls the main process issues (the latter is this 26999 from the output which is easy to confirm from the dump):

bash-4.2$ grep 26999 stracenew | grep futex
26999 0.000049 futex(0x7f78bff870d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000008>
26999 0.000035 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000083 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <9.437582>
26999 0.000025 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000009 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.105477>
26999 0.000009 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
27387 0.000014 futex(0x7f78b80eb554, FUTEX_WAIT_BITSET_PRIVATE, 3605, {tv_sec=1403988, tv_nsec=482699946}, 0xffffffff
26999 0.000007 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <1.962090>
26999 0.000014 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000024 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.908004>
26999 0.000022 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL
26999 0.000010 < futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <0.038960>
26999 0.000009 futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL

The thread 27387 is not what I am looking for. In all other cases futex syscalls complete fast but this one “futex(0x7f78c05ce9d0, FUTEX_WAIT, 27352, NULL <unfinished…>” not yet finished. When the second parameter of the futex syscall is FUTEX_WAIT, it will check if the value at the address 0x7f78c05ce9d0 still contains the value of 27352 in which case it blocks until (if the fourth parameter is NULL) the FUTEX_WAKE operation is done on it by another futex syscall. This indicates one interesting point: “Apparently, the FUTEX_WAKE futex syscall completes late, otherwise, the main thread which is the issuer of the FUTEX_WAIT would have completed its job and return“.

The game-changing question now becomes this: “Which of the threads is responsible for the FUTEX_WAKE futex syscall?” If this is found, then what would remain is to analyze what this particular thread is busy with which causes it to issue the the futex syscall with FUTEX_WAKE late.

So, the futex word is 27352 and whoever the FUTEX_WAKE issuer is, it is this thread which will updated the futex word at the 0x7f78c05ce9d0 address so that the FUTEX_WAIT issue can detect it and resume… and 27352 very much resembles a thread id. If it is indeed so, then there has to be the clone/fork syscall issued by the main process 26999:

bash-4.2$ grep 27352 stracenew | grep 26999 | grep clone 
26999      0.000039 clone(child_stack=0x7f78c05cdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f78c05ce9d0, tls=0x7f78c05ce700, child_tidptr=0x7f78c05ce9d0) = 27352 <0.000058>

Gotten. The clone syscall returns the thread id.

If the main process waits for the futex word to become 27352 and this is the id of one of its child process, then the main one must have assigned the job to exactly this child and the child is responsible for issuing the FUTEX_WAKE. Thus, it is this thread 27352 which comes back late which can happen for one of the either reasons: it issues excessive number of syscalls or has a couple of syscalls that take the most amount of time.

Let’s check out the first case. The script below will group by the syscall name and print number of occurrences for each:

awk '{ syscall_name = substr($3, 1, index($3, "(") - 1); syscall_count[syscall_name]++ } END { for (syscall_name in syscall_count) order[syscall_count[syscall_name], syscall_name] = syscall_name; PROCINFO["sorted_in"] = "@ind_num_desc"; i = 0; for (j in order) { print order[j], syscall_count[order[j]]; i++; if (i == 10) break; } }' stracenew
stat 330179
futex 49207
read 25507
lseek 23337
lstat 17729
rt_sigprocmask 6369
close 5688
mprotect 5046
mmap 2425

The thread 27352 issues lots of stat syscalls for some reason. The total number of the syscalls in the dump is 425896. So, 330179/425896 ~ 78% of the job is on this guy 27352. These are the sample calls:

27352      0.000057 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/resources.jar", {st_mode=S_IFREG|0644, st_size=3505346, ...}) = 0 <0.000011>
27352      0.000052 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/rt.jar", {st_mode=S_IFREG|0644, st_size=66339559, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/sunrsasign.jar", 0x7f78c05cd920) = -1 ENOENT (No such file or directory) <0.000010>
27352      0.000044 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jsse.jar", {st_mode=S_IFREG|0644, st_size=660664, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jce.jar", {st_mode=S_IFREG|0644, st_size=115651, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/charsets.jar", {st_mode=S_IFREG|0644, st_size=3135615, ...}) = 0 <0.000011>
27352      0.000046 stat("/usr/java/jdk1.8.0_181-cloudera/jre/lib/jfr.jar", {st_mode=S_IFREG|0644, st_size=560713, ...}) = 0 <0.000011>

The stat syscall returns attributes of a file. This does not have to involve an I/O.

The next question is “Why the app issues so many syscalls to complete a single command?

The fix is another story. The issue is diagnosed !!!

Diving deeper into the LWLock in PostgreSQL.

Lightweight locks (LWLocks) are a type of interprocess lock used by PostgreSQL to coordinate simultaneous access of multiple processes to a shared memory region. There are various reasons that can cause this lock to create issues in the database. I will now provide a short yet concise demo, followed by an exploration of the nature of LWLock in PostgreSQL through reference to its source code and OS stuff.

As a first step, I will generate some load on the database using pgbench. While not extremely flexible, this tool is adequate for testing some basics. The provided inputs are as follows.

[postgres@centos8 ~]$ pgbench -c 10 -j 10 -t 100000 -S tpcc
pgbench (15.5)
starting vacuum...end.

It spawns 10 (-j) worker threads to execute 10 (-c) concurrent processes, each of which runs 100,000 read-only queries as fast as possible (this is the default, and the rate is configurable as well). Visually, it appears as follows:

This is certainly not the behavior one would expect from a well-performing database with reasonable response times. The dark read LWLock prevails and we certainly need to bring it down. Typically, the LWLock is accompanied by I/O, as is the case here. However, this is not always the scenario, as a few other reasons may exist. Let’s assume we have no information about the database and try to figure out what is going on.

PostgreSQL introduced the pgsentinel extension, which, when loaded, samples the active sessions every second by default and makes the information available through the pg_active_session_history view. This allows querying what the sessions were doing at a particular point in time or within a range, making it very useful when diagnosing performance issues:

[postgres@centos8 home]$ ./ash wait_event_type,wait_event "backend_type='client backend'" '2023-11-24 20:29:00' '2023-11-24 20:30:00'
--------------------------------------------------------------------------------------------------------
 wait_event_type |    wait_event    | samples | sessions 
-----------------+------------------+---------+----------
 CPU             | CPU              |     213 |       11
 LWLock          | BufferMapping    |     196 |       10
 IO              | DataFileRead     |      61 |       10
 IPC             | BufferIO         |      50 |       10
 IPC             | ExecuteGather    |       7 |        7
 IPC             | BgWorkerShutdown |       4 |        4
(6 rows)

This is a bash script which connects to the database and queries data from pg_active_session_history view. The first list (wait_event_type, wait_event) is the grouping columns followed by the filter, and followed by the range in which we are interested.

From the output (as well as the chart), it becomes evident that the highlighted LWLock is the culprit, with approximately 10 sessions waiting for BufferMapping. This wait event occurs when a process is waiting to map a page from the disk to a buffer in the shared buffers. Once the necessary disk page has been identified, the process must find a free buffer in the buffer pool to load the page into. This is precisely what the processes are currently awaiting. Before we proceed to rerun the script and examine additional dimensions that may be helpful, let’s draw reasonable few conclusions here.

If the process cannot find a free buffer in the buffer pool, then the most likely suspect is that the “Shared buffers is full“. The (misleading) path to take then would (or could) be something like “Well, if it is full, then let’s add up some space.” Although a feasible yet rare solution in practice, the correct approach to handling it would be to ask, ‘What is polluting it?”. Well…definitely, the queries :)). Let’s then find out what these queries look like (only top 20 rows are returned, by default).

[postgres@centos8 home]$ ./ash wait_event_type,wait_event,query "backend_type='client backend'" '2023-11-24 20:29:00' '2023-11-24 20:30:00'
--------------------------------------------------------------------------------------------------------
 wait_event_type |  wait_event   |                           query                           | samples | sessions 
-----------------+---------------+-----------------------------------------------------------+---------+----------
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 1679538 |       5 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 6508293 |       4 |        1
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 649929  |       4 |        1
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 6487070 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 5088721 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9953261 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1659220 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9321817 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9104089 |       3 |        1
 IO              | DataFileRead  | SELECT abalance FROM pgbench_accounts WHERE aid = 8129419 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1997345 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1172966 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 513858  |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9534579 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1393497 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 3166138 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 3735733 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 2971116 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 4772208 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 2960432 |       3 |        1
(20 rows)

All sessions are executing the same query but passing different literals. They look very innocent: a small point query which (apparently; because the column “aid” looks like some sort of ID) returns just a single row). Remember, we have no information about the database – nothing at all. That is, we cannot allow some names, IDs, and etc. put us on the wrong path.

We now know what is polluting the buffer pool. So, we have to see why this query polluting it. The query explain plan will show us how the details of the query execution:

tpcc=# explain (analyze, buffers) SELECT abalance FROM pgbench_accounts WHERE aid = 1679538;
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..217018.43 rows=1 width=4) (actual time=2434.719..2434.813 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=12678 read=151257
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..216018.33 rows=1 width=4) (actual time=2166.930..2423.078 rows=0 loops=3)
         Filter: (aid = 1679538)
         Rows Removed by Filter: 3333333
         Buffers: shared hit=12678 read=151257
 Planning:
   Buffers: shared hit=41 read=1
 Planning Time: 6.413 ms
 Execution Time: 2434.930 ms
(12 rows)

The query reads 151257 pages from disk to return a single row (rows=1). Luckily, 12678 are found in the shared buffer but still, 1.5k pages is way too high to spit out a single row. The other interesting aspect here is the second highlighted line: “Rows Removed by Filter: 3333333“. In actuality, the database didn’t remove this number of rows by filter. It is much more than that: 9999999. The explanation is in another LinkedIn post of mine :)).

So, the query reads lots of pages, out of which only one holds the necessary row. This is what actually pollutes the buffer. Adding an index on the filtering column will definitely make it much more better. As a result, the query will read only two pages: one from the index, another one from the table (unless the table/index is bloated). This is how things got better after the column has been indexed:

[postgres@centos8 home]$ ./ash wait_event_type,wait_event "1=1" '2023-11-24 22:00:00' '2023-11-24 22:01:00'
--------------------------------------------------------------------------------------------------------
 wait_event_type |  wait_event  | samples | sessions 
-----------------+--------------+---------+----------
 Client          | ClientRead   |      36 |       10
 CPU             | CPU          |      16 |        8
 IO              | DataFileRead |      16 |        8
(3 rows)

LWLock is gone and not there anymore.

Continued
Now that it is clear how to deal with LWLock (at least in a typical case) related waits, it is also important to understand the implementation details of it. This time, we’ll delve into the details of how it is acquired (the release should be much simpler), and what actually counts as wait duration when the lock is being waited for. The current wait interface in PostgreSQL does not expose the duration of the wait. One of the first valid approaches is to examine the functions that execute when the process attempts to acquire the lock.

Let’s sample the stack trace with pstack and see what is going on (I’ve sampled it many times in a loop for a single process but will show only the top one here.):

[root@centos8 ~]# pstack 4391
#0  0x00007fd95fd9cda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd95fd9ce98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00000000007815a2 in PGSemaphoreLock (sema=0x7fd819711eb8) at pg_sema.c:327
#3  0x00000000007fca6c in LWLockAcquire (lock=lock@entry=0x7fd819715580, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1315
#4  0x00000000007dbf8f in BufferAlloc (foundPtr=0x7fff9354e59b, strategy=0x1c7da98, blockNum=134785, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x1c3d8f8) at bufmgr.c:1320
#5  ReadBuffer_common (smgr=0x1c3d8f8, relpersistence=<optimized out>, forkNum=MAIN_FORKNUM, blockNum=134785, mode=RBM_NORMAL, strategy=0x1c7da98, hit=0x7fff9354e647) at bufmgr.c:873
#6  0x00000000007dc6f0 in ReadBufferExtended (reln=0x7fd96169d1f8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=134785, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at ../../../../src/include/utils/rel.h:560
#7  0x0000000000549c66 in heapgetpage (sscan=sscan@entry=0x1c7cd18, page=page@entry=134785) at heapam.c:402
#8  0x000000000054a1f6 in heapgettup_pagemode (scan=scan@entry=0x1c7cd18, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1120
#9  0x000000000054b162 in heap_getnextslot (sscan=0x1c7cd18, direction=<optimized out>, slot=0x1c73270) at heapam.c:1352
#10 0x00000000006cff6b in table_scan_getnextslot (slot=0x1c73270, direction=ForwardScanDirection, sscan=<optimized out>) at ../../../src/include/access/tableam.h:1046
#11 SeqNext (node=node@entry=0x1c730d0) at nodeSeqscan.c:80
#12 0x00000000006a9c52 in ExecScanFetch (recheckMtd=0x6cfeb0 <SeqRecheck>, accessMtd=0x6cfee0 <SeqNext>, node=0x1c730d0) at execScan.c:132
#13 ExecScan (node=0x1c730d0, accessMtd=0x6cfee0 <SeqNext>, recheckMtd=0x6cfeb0 <SeqRecheck>) at execScan.c:198
#14 0x00000000006ba8fa in ExecProcNode (node=0x1c730d0) at ../../../src/include/executor/executor.h:262
#15 gather_getnext (gatherstate=0x1c72ee0) at nodeGather.c:295
#16 ExecGather (pstate=0x1c72ee0) at nodeGather.c:227

Relevant lines are highlighted. The BufferAlloc function is responsible for seeking a free buffer. Once found, it will call LWLockAcquire which attempts to acquire the lock on the buffer. The remaining three functions (0, 1, and 2) suggest that the process is likely using a semaphore (in addition, some futex-related operations are also present) when it fails to acquire the lock.

What cannot be seen from here is whether the process sleeps or spins, and if it sleeps, then for how long. Tracing system calls can help us here. Their frequency and inputs can help us discern the nature of the waiting. Let’s strace it and see:

pread64(6, "\6\0\0\0(\257\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62840832) = 8192
pread64(6, "\6\0\0\0\330\311\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62849024) = 8192
futex(0x7fc549a66c38, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
pread64(6, "\6\0\0\0h\343\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62857216) = 8192
pread64(6, "\6\0\0\0 \375\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62865408) = 8192
pread64(6, "\6\0\0\0\360\26\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62873600) = 8192
pread64(6, "\6\0\0\0\3000\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62881792) = 8192
pread64(6, "\6\0\0\0\220J\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62889984) = 8192
pread64(6, "\6\0\0\0`d\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62898176) = 8192
pread64(6, "\6\0\0\0\30~\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62906368) = 8192
futex(0x7fc549a66c38, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fc549a67038, FUTEX_WAKE, 1)    = 1

This is a portion of the trace. The futex syscall is not that frequently there. Moreover, on lines 15 and 16 there are two of them with different inputs. The one on line 15 makes the calling thread to sleep until awakened. This is because the fourth parameter, which determines the timeout of the sleep, is NULL. The next futex call wakes up a waiting thread. The third input is 1, indicating that only one thread must be awakened.

We now know how long a thread sleeps when it fails to acquire the lock — until it is woken up. Additionally, we have determined that there is either no or very little spinning.

We now peek at the source code. The call stack above shows that the LWLockAcquire is there in the lwlock.c file (irrelevant lines have been replaced with dots):

bool
LWLockAcquire(LWLock *lock, LWLockMode mode)
{
	 ....................................................................
	for (;;)
	{
		bool		mustwait;

		/*
		 * Try to grab the lock the first time, we're not in the waitqueue
		 * yet/anymore.
		 */
		mustwait = LWLockAttemptLock(lock, mode);

		if (!mustwait)
		{
			LOG_LWDEBUG("LWLockAcquire", lock, "immediately acquired lock");
			break;				/* got the lock */
		}
		/* add to the queue */
		LWLockQueueSelf(lock, mode);

		/* we're now guaranteed to be woken up if necessary */
		mustwait = LWLockAttemptLock(lock, mode);

		/* ok, grabbed the lock the second time round, need to undo queueing */
		if (!mustwait)
		{
			LOG_LWDEBUG("LWLockAcquire", lock, "acquired, undoing queue");

			LWLockDequeueSelf(lock);
			break;
		}

		/*
		 * Wait until awakened.
		 *
		 * It is possible that we get awakened for a reason other than being
		 * signaled by LWLockRelease.  If so, loop back and wait again.  Once
		 * we've gotten the LWLock, re-increment the sema by the number of
		 * additional signals received.
		 */
		LOG_LWDEBUG("LWLockAcquire", lock, "waiting");

#ifdef LWLOCK_STATS
		lwstats->block_count++;
#endif

		LWLockReportWaitStart(lock);
		if (TRACE_POSTGRESQL_LWLOCK_WAIT_START_ENABLED())
			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);

		for (;;)
		{
			PGSemaphoreLock(proc->sem);
			if (!proc->lwWaiting)
				break;
			extraWaits++;
		}

		/* Retrying, allow LWLockRelease to release waiters again. */
		pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_RELEASE_OK);
    ....................................................................
		LWLockReportWaitEnd();

		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");

		/* Now loop back and try to acquire lock again. */
		result = false;
	}
	....................................................................
	return result;
}

This is how it works. First, the process attempts to acquire the lock (line 13), and if unsuccessful, it queues itself up (line 21). The next attempt follows (line 24), and if successful, the process dequeues itself (line 31), allowing it to proceed when awakened. If it fails, it starts sleeping on a semaphore (line 55). Right before the sleep, the process registers a wait by calling LWLockReportWaitStart (line 49) and ends the wait by calling LWLockReportWaitEnd (line 64). The sleep ends when the process is awakened (line 57), and it retries to acquire the lock (line 62).

Therefore, the process of LWLock acquisition is more about sleeping than spinning. The duration of the wait is the time a process sleeps on a semaphore, excluding external waits. Just a few spins are negligible in comparison to this.

More drill-down in future post…

Reasoning about OS Scheduler for Performance Degradation Analysis

It all started with a performance issue in one of our clusters. One of the components on the critical path turned out to be the OS scheduler which I analyzed to find out whether it was the offender itself or just a symptom, and the global slowness was caused by other factors in the stack. The whole story is very long and would take (I guess) several pages or so to describe everything. That is why I will just present here a short exert which is a part of the whole diagnosis journey. So, don’t be surprised to notice that the story does not (sort of) have neither the beginning nor the end.

I decided to peek at the OS scheduler as vmstat reported high run queue lengths. This led me to conclude that the former will highly likely bring me to the ultimate offender. The scheduler statistics can be traced with the perf command which has many options. Since the issue was ongoing for several minutes, few tens of seconds would suffice. Moreover, as I didn’t know which scheduler event should be focused upon, I traced all of them:

[fgenceliyev@machine1 ~]$ sudo perf sched record -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 50.183 MB perf.data (132,856 samples) ]

The next step was to check whether what I suspected was true and I was going in the right direction (the output is intentionally shortened):

[fgenceliyev@machine1 ~]$ sudo perf sched latency --sort max

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  MutationStage-1:(2)   |    240.607 ms |    13933 | avg:    0.083 ms | max:  416.618 ms | max at: 275770.668504 s
  C2 CompilerThre:(6)   |   9980.936 ms |     2010 | avg:    0.934 ms | max:  322.393 ms | max at: 275772.038103 s
  MutationStage-8:17873 |     50.710 ms |     1743 | avg:    0.248 ms | max:  310.998 ms | max at: 275771.587630 s
  MutationStage-5:17870 |     62.694 ms |     2213 | avg:    0.347 ms | max:  310.980 ms | max at: 275771.587612 s
  Thread-17:19000       |     49.637 ms |      783 | avg:    0.494 ms | max:  310.795 ms | max at: 275771.587998 s
  vmtoolsd:1972         |     17.737 ms |       73 | avg:    4.401 ms | max:  287.086 ms | max at: 275769.602360 s
  ................................................................................................................
  ................................................................................................................
  ................................................................................................................

No comments here: huge delays ! The are two good news here though: a) the output has a timestamp that shows when the maximum delay happened for a particular process (or thread) and b) it doesn’t matter which process from the output we now chase to drill down further since (whatever it is) all of them suffer from the same.

At this point, we know the delay is there but we don’t know where that delay occurs. In other words, we now need to split the delay into its parts and see where exactly time is spent most (we can take an arbitrary process to measure).

Let’s take the very first timestamp (275770.668504) for analysis and see how time is distributed. This can be achieved by using timehist option of perf which gives a breakdown of the events.

[fgenceliyev@machine1 ~]$ sudo perf sched timehist -f
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  275767.383658 [0001]  cluster1-nio-wo[18972/18944]        0.000      0.000      0.000 
  275767.383663 [0002]  <idle>                              0.000      0.000      0.000 
  275767.383664 [0000]  <idle>                              0.000      0.000      0.000 
  275767.383670 [0002]  MutationStage-5[17870/17693]        0.000      0.000      0.006 
  275767.383670 [0000]  MutationStage-1[17803/17693]        0.000      0.000      0.006 
  275767.383696 [0003]  cluster1-nio-wo[18970/18944]        0.000      0.000      0.000 
  275767.383697 [0001]  Thread-6[18989/18944]               0.000      0.000      0.039 
  275767.383699 [0005]  <idle>                              0.000      0.000      0.000
  .....................................................................................
  .....................................................................................
  .....................................................................................
  275770.668504 [0004]  cluster1-nio-wo[18971/18944]      416.371      0.277      0.254
  .....................................................................................
  .....................................................................................
  .....................................................................................

We are interested in two columns: wait time and sch delay. The first is the time a task was waiting to be woken up and the second is the time from the moment the context switch begins up to the point a task is scheduled in. The latter is the scheduler latency i.e. time it takes the scheduler to change the task’s status to Runnable so that it can then be switched to Running.

What we can see (line 16) is that the scheduler latency is very high: 0.277 msecs but what’s even worse is that the task has been waiting for 416 msecs just to be picked up by the scheduler. Such a situation is a little tricky as it resembles the ‘chicken and egg problem’ since either can cause the other one: the scheduler’s slow decision could be because of the same reasons that the task had to wait long and, conversely, high scheduler latencies can cause significant delays before a task receives a wake-up signal. Without any evidence we cannot judge neither to be the cause. If we don’t have enough evidence (and sometimes we don’t :)) ) we should be able to draw conclusions like what is more and what is less likely.

Since we don’t know which caused the other one we should attack each of them separately as if we don’t have the other issue at all. Thus, the first question is “What affects latency of scheduler decisions?”

Recall that the OS scheduler is just another kernel thread and to do its job it needs CPU time. If the CPU is saturated (for example, by doing heavy I/O or doing moderate I/O from a slow device) then the scheduler’s behavior can be affected and, hence, its latency may rise.

This is a three-node Apache Cassandra cluster. For each type of tasks (such as handling client requests, updating the tables and etc.) the database maintains a pool of threads each of which performs its job. We need to see what those threads are doing and, apparently, that will tell us the actions that burn the CPU (or which are slow). To see the breakdown by threads we will use the psn (Tanel Poder’s process snapper)

=== Active Threads =============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm2             | state                  | wchan                        | kstack                                                                                                                                                                                                                                                                                                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      32 |        4.00 | (Native-Transpor) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      20 |        2.50 | (MutationStage-1) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      19 |        2.38 | (Native-Transpor) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      18 |        2.25 | (Native-Transpor) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      16 |        2.00 | (java)            | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       8 |        1.00 | (MutationStage-1) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       7 |        0.88 | (Native-Transpor) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       6 |        0.75 | (Native-Transpor) | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->hrtimer_cancel()->hrtimer_try_to_cancel()                                                                                                                                                                                                                                                                                                                                                        
       5 |        0.62 | (MutationStage-6) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       5 |        0.62 | (epollEventLoopG) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (MutationStage-9) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       4 |        0.50 | (Native-Transpor) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (java)            | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                                                            
       3 |        0.38 | (MutationStage-1) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       3 |        0.38 | (Native-Transpor) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-1) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-2) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       2 |        0.25 | (MutationStage-7) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       2 |        0.25 | (MutationStage-8) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       1 |        0.12 | (MessagingServic) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
  ................................................................................................................
  ................................................................................................................
  ................................................................................................................

The comm2 column shows a thread from the thread pool and the state column shows what that thread was doing at the time of sampling. What is clear (line 59) is that a number of threads is “ON CPU” but what also catches eye is that quite a portion of them are waiting for disk I/O to complete (Disk (Uninterruptible) state). Here I’m showing just one sample and not going into details of the values in wchain and kstack columns. Sampling the threads several times shows the same behavior. For a database like Cassandra (if configured properly) we should not observe many threads spending their time on I/O.

This leads us to suspect that the storage subsystem is not catching up. The iostat output is to the rescue (for the following two outputs I got only screens saved aside):

It takes 53 msecs to write 4.77 MB of data! Plus, avgqu-sz is low, thus, there is no queueing up. The next contact is the storage team.

Let’s also see bcc-tools output just for diversity. The last column is latency in milliseconds and the third column from the end is the amount of data (in bytes) written: 512 KB gets written in 0.4 secs !

The whole investigation journey could have been much shorter should we simply had run iostat and analyze the I/O latency. But that would be just a ‘matter of luck’. Even if the path is long, if you take the right direction from the very beginning it will definitely take you to the cause.

This is it for now! Hope you enjoyed !

Six Ways to Force “Sequential Scan” in PostgreSQL

Here are some six ways that can be used to force a query go with sequential scan access method in PostgresSQL. The objective is to eliminate any index access in the query. The test table looks as follows.

postgres=# create table tablescandemo as select generate_series(1,100000) id;
SELECT 100000
postgres=# create index ind1demo on testtable1(id);
CREATE INDEX

Let’s get started.

Option I. Using enable_indexscan, enable_indexonlyscan and enable_bitmapscan parameters.
The enable_indexscan parameter disables index access path. Let’s try it out.

postgres=# set enable_indexscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tablescandemo  (cost=12.17..482.50 rows=500 width=4) (actual time=0.039..0.041 rows=1 loops=1)
   Recheck Cond: (id = 100)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on inddemo  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.027 rows=1 loops=1)
         Index Cond: (id = 100)
 Planning Time: 0.568 ms
 Execution Time: 0.091 ms
(7 rows)

The planner still accesses the index but this time it is behaving differently. The index is accessed in the same way as in index scan method, however, as a result, the database doesn’t build a set of rowids of the heap rows but a bitmap: one bit per heap page. This is how the bitmap index scan works. The table is sequentially scanned and the bitmap is used to decide whether a particular page must be probed or not. The enable_bitmapscan parameter controls bitmap index scan access method:

postgres=# set enable_bitmapscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using inddemo on tablescandemo  (cost=0.29..4.31 rows=1 width=4) (actual time=0.070..0.071 rows=1 loops=1)
   Index Cond: (id = 100)
   Heap Fetches: 0
 Planning Time: 0.086 ms
 Execution Time: 0.094 ms
(5 rows)

Now, index only scan scan kicks in since the table has a single column which is indexed. Let’s turn it off as well:

postgres=# set enable_indexonlyscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.021..9.114 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.065 ms
 Execution Time: 9.137 ms
(5 rows)

Basically, once needs to make sure that not just one but all index access methods (index scan, index only scan and bitmap index scan) are off.

Option II. Using random_page_cost parameter.
This parameter tells the optimizer the cost of a randomly accessed disk page. This concerns both index and table pages. Therefore, the higher this value, the more it should cost the optimizer to use index access path, especially, if the index is not very selective. Let’s see the experiment results (index scan, index only scan and bitmap index scan control parameter values are all on now):

postgres=# set random_page_cost = 100000;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.034..10.175 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.071 ms
 Execution Time: 10.200 ms
(5 rows)

The optimizer estimates that retrieving tuple(s) via the index will be higher than sequential scan and, hence, prefers solely the latter.

Option III. Altering table statistics.
At the end of the day, optimizer uses database statistics which one can actually fake. Selectivity of the predicate is the primary factor in deciding which access path to take. It is based on the cardinality which itself calculated as the number of tuples that qualify the predicate. The latter is (assuming no extended statistics exist) nothing but N/NDV where N is the number of tuples in the table and NDV is the number of distinct values of the column. The higher the NDV, the more selective the predicate is, and, thus, the higher the probability that the optimizer leans towards index access rather than sequential scan. Therefore, if NDV is low, the optimizer will think that most of the table tuples qualify: that is, index access will cost much more than sequential scan, thus, preferring the latter. Let’s check it out.

postgres=# alter table tablescandemo alter column id set (n_distinct = 1);
ALTER TABLE
postgres=# analyze tablescandemo;
ANALYZE
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=100000 width=4) (actual time=0.015..6.366 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.098 ms
 Execution Time: 6.379 ms
(5 rows)

Option IV. ‘Hacking’ table statistics.
Even though the previous option is a valid one, it has two drawbacks: 1) alter table command will lock the table in exclusive mode and 2) analyze table may take considerable amount of time depending on the size of the table. So, if that is the case, then you can update statistics directly in the pg_statistic table. Surely, this is not an accurate way but still works and in extreme cases may come in handy:

postgres=# update pg_statistic set stadistinct = 1 where starelid = (select oid from pg_class where relname = 'tablescandemo');;
UPDATE 1
postgres=# explain analyze select * from tablescandemo where id = 6;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo(cost=0.00..1693.00 rows=100000 width=4) (actual time=0.008..6.612 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.115 ms
 Execution Time: 6.625 ms
(5 rows)

postgres=#

Option V. ‘Hacking’ table statistics (Part II).
Another statistics you can fake is number of pages a table contains. The lesser it is, the higher the probability that sequential scan will kick in. This information is kept in pg_class table and can be manipulated just like any other column value:

postgres=# update pg_class set reltuples=1 where relname = 'tablescandemo';
UPDATE 1
postgres=# explain analyze select * from tablescandemo where id = 6;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo(cost=0.00..443.01 rows=1 width=4) (actual time=0.010..6.496 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.138 ms
 Execution Time: 6.511 ms
(5 rows)
postgres=#

Option VI. Using hints.
PostgreSQL does not support hints out-of-the-box. However, you can use pg_hint_plan to hint queries. It provides hints for scan methods, join methods, join order, parallelism, row number correction and few others.

Ohh…yes…and option VII: drop all indexes :)).

Parallel Query and %logons% Statistics

Here is a quick and interesting observation regarding the %logons% statistics. The database version is Oracle 19c and, apparently, it applies to earlier versions as well. The question that I am trying to answer here is this “How does a parallel query execution relate to %logon% statistics?”

Here is a test case.

Let’s check the statistics for the instance:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68368
logons current				      108
user logons cumulative            6762

SQL> 

Now, do nothing but log on to the database:

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

SQL>

Let’s check the statistics again:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68369
logons current				      109
user logons cumulative            6763

SQL> 

As a result, each of the statistics above gets incremented by one. So far, so good.

Now, from within that brand-new session, let’s execute a parallel query:

SQL> select /*+ parallel(8) */ count(*) from tt1;

  COUNT(*)
----------
     73252

SQL> 

Display the statistics to compare:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68377
logons current				      109
user logons cumulative            6763

SQL> 

As a result, the only statistics that changes is “logons cumulative“. It has grown by exactly the number of allocated parallel slaves: 8. Repetitive execution of the same query with different degrees of parallelism increments only this statistics and it does this by that degree each time.

The “logons current” statistics doesn’t change. This one doesn’t include the slave sessions but the coordinator only. The coordinator sid is the actually the sid of the session that executes the query. Another observation is that “user logons cumulative” statistics does not include slave processes either.

This may come in very handy when troubleshooting a logon storm caused by parallel execution.

Implicit Population of Materialized Views in In-Memory Column Store

The quickest and simplest way of ensuring what has been populated in In-Memory column store (IMCS) is to query the V$IM_SEGMENTS view. Whenever an object is loaded into IMCS, the database will add a record about it into this view. In general, I guess the appearance of the IMCS as such is an attempt of Oracle to transition to NewSQL stack.

According to Oracle documentation the IMCS can be enabled for a tablespace, table (including external and excluding IOT), table partition and materialized view (MV). What is more interesting is that it also mentions when, for example, a table gets populated in IMCS – only when it is fully scanned (and not parallelly). Index access methods that cover table tuples and table tuples populated by rowids never load the tuples in IMCS.

Documentation does not say a word about when a MV gets populated in IMCS. It would be reasonable to assume that since a MV is very much like a heap table, then the table IMCS population rules apply to the former as well i.e. for a MV to get populated in IMCS an explicit nonparallel full scan of it is necessary. This is indeed so. Additionally, implicit population of MV is also possible and this with a little trick only. The table access which causes query rewrite (fully accessing the MV) can also trigger it. All this said, this post is not about something advanced but rather an interesting take-away about IMCS.

The test case is on Oracle 12C. First, let’s create a non-empty table and define a MV on it:

SQL> create table testtable1 as select object_id, object_name from dba_objects;

Table created.

SQL> create materialized view testview1 enable query rewrite as select * from testtable1;

Materialized view created.

SQL>

Let’s now enable IMCS for the view and confirm that the it really gets populated in there when scanned fully:

SQL> alter materialized view testview1 inmemory;

Materialized view altered.

SQL> select count(*) from testview1;

  COUNT(*)
----------
     73306

SQL> select segment_name, populate_status from v$im_segments;

SEGMENT_NAME                        POPULATE_STATUS
----------------------------------- -------------
TESTVIEW1			                 COMPLETED

Now, let’s flush the MV from IMCS but make it eligible for population again. Also, let’s query the table so that the query rewrite kicks in:

SQL> alter materialized view testview1 no inmemory;

Materialized view altered.

SQL> alter materialized view testview1 inmemory;

SQL> select segment_name, populate_status from v$im_segments;

no rows selected.

SQL>

As the result from v$im_segments shows, the MV is not there…yet. Let’s cause a full scan of the MV again:

SQL> select count(*) from testview1;

  COUNT(*)
----------
     73306

SQL> select segment_name, populate_status from v$im_segments;

no rows selected.

SQL>

Subsequent executions of the same query still doesn’t make the MV appear in IMCS, although the little trick will make it to get populated:

SQL> alter table testtable1 inmemory;

Table altered.

SQL> select count(*) from testtable1;

  COUNT(*)
----------
     73305

SQL> select segment_name, populate_status from v$im_segments;

SEGMENT_NAME	                POPULATE_STATUS
------------------------------ -------------
TESTVIEW1		                COMPLETED

SQL>

The query plan now is as follows:

Plan hash value: 3466806115
 
-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name      | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |           |     1 |     3   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                            |           |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR                           |           |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)                     | :TQ10000  |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                         |           |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR                     |           | 73306 |     3   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       MAT_VIEW REWRITE ACCESS INMEMORY FULL| TESTVIEW1 | 73306 |     3   (0)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------

The next time the query will execute the table data will be retrieved from the MV which is now in IMCS (the reason for going parallel is described in the note 2373556.1 which is a bug and has a workaround). The query rewrite can cause MV population in IMCS and for that to happen the base table must have the INMEMORY option enabled.

That’s all for this time. More advanced stuff is coming soon…

Parallel Query and V$SQL

The following is a (short and quick) scenario I recently faced when analyzing a parallel query and its response time. This may not apply to earlier versions of Oracle though. The experiments were carried out in 19c.

The ‘weird’ behavior I observed was that V$SQL.EXECUTIONS column didn’t get updated for the query under analysis when executed. The first gotcha is, perhaps, “That must be a bug“. Let’s be more systematic however, and arrive at the root cause gradually.

Here is a test case. First, I am simply executing the query three times and checking the v$sql.

SQL> select sql_id, sql_text, executions, child_number from v$sql where sql_text like '%test1%';

SQL_ID	      SQL_TEXT                              EXECUTIONS CHILD_NUMBER
------------- ---------------------------------     ---------- ------------
4gy00kap2kf3n	select /*+ test1 */ * from table1   3	        0
4gy00kap2kf3n	select /*+ test1 */ * from table1   0	        1

2 rows selected.

SQL> 

The output shows that the database decided to generate another child (CHILD_NUMBER=1) for this statement. However, the primary question here is “Why are there NO executions for the second child?“.

The reason for not sharing that can be found from v$sql_shared_cursor is not of interest, although it may shed some light on the question. Still, let’s take a different strategy without consulting this view.

Two approaches can be taken to move forward with this ‘issue’: a) Try A, try B, try C,… and hope it helps to find out the reason, and perhaps, eventually one of them will and b) Ask the right questions. Before blaming the database for bugs, lets put it aside as the last resort. There must be something about this that at this point not known yet and that may lead to the clue.

The simplest question one can ask here can be something like “Would there by any other statistics that applies to child cursors which has probably been missed out?” This is a separate child and, thus, the entry point should be v$sql. Scanning the column names of the view for ‘%EXEC’% reveals another field: PX_SERVERS_EXECUTIONS. Let’s query this column as well:

SQL> select sql_id, sql_text, executions, px_servers_executions, child_number from v$sql where sql_text like '%test1%';

SQL_ID	      SQL_TEXT						      EXECUTIONS PX_SERVERS_EXECUTIONS CHILD_NUMBER
------------- ---------------------------------- ---------- --------------------- ------------

4gy00kap2kf3n select /*+ test1 */ * from table1  3			 0	                   0
4gy00kap2kf3n select /*+ test1 */ * from table1  0			 12	                   1

2 rows selected.

SQL> 

PX_SERVERS_EXECUTIONS holds the total number of executions by the slaves of a parallel query. This explains why the second cursor executions are not reflected in V$SQL.EXECUTIONS column. For parallel slaves Oracle tracks it in a different column: V$SQL.PX_SERVERS_EXECUTIONS and in this case it has been executed 12 times so far. Since the query does have any parallelism forced by the hint, hence the table-level parallelism must set which is confirmed by dba_tables.degree being equal to 4. The one with CHILD_NUMBER=0 is actually the coordinator. The query executes in parallel:

SQL> select * from dbms_xplan.display_cursor('4gy00kap2kf3n', 0, 'ALLSTATS ALL');

SQL_ID	4gy00kap2kf3n, child number 0
-------------------------------------
select /*+ test1 */ * from table1

Plan hash value: 1574891232
 
--------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          | 73216 |  9438K|   110   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)| :TQ10000 | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX BLOCK ITERATOR |          | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   4 |     TABLE ACCESS FULL| TABLE1   | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (E - Syntax error (1))
---------------------------------------------------------------------------
   1 -  SEL$1
   E -  test1
Note
-----
   - Degree of Parallelism is 4 because of table property

28 rows selected.

SQL>

Morale: Step-by-step approach is the key regardless of the issue and the DBMS.

A potential method for determining the join method without explain plan

I will not pretend that the idea presented in this post is aimed at addressing a number of troubleshooting scenarios but in some cases it may come in handy. For example, imagine a situation such that you cannot log in to the database because of a long-running query which makes it unresponsive. However, the OS is running fine and can be worked with.

The simple and not-yet-so-mature approach presented here should help one judge about the potential join method that a query is performing. At least, it may give certain (though incomplete) ground to reason about it.

Given all other conditions are same, the Nested Loop join method is the one that will execute the highest number of iterations to fetch the requested tuples. The other two methods, Hash Join and Sort-Merge Join will face less number of iterations and, perhaps, less number of loops.

The scenario with the assumptions above is not a rare. However, the question can be “Once you start suspecting that it is most probably the Nested Loop join that’s causing the hang, how can it help?”. At least, one can tell the software developers to fix the query by changing the join method which should solve the issue. Since (as assumed) the database is unresponsive, one cannot peek at the explain plan and tell for sure. What can help though is “expressing the join method via loops”.

To do this, let’s step aside a little.

When the compiler converts a loop in high-level code (while/for) to assembly, it maps it to a Branch instruction. The branch instruction updates the contents of the Program Counter (PC; also known as Extended Instruction Pointer in x86) register by the address stored in the Branch Target Address (BTA). On the next cycle the CPU fetches the instruction from the PC (which is the address of the instruction pointed to by the branch instruction) and executes it. To illustrate it, here is an assembly example for ARM processors:

.global main

main:
        mov     r0, #0     /* setting up initial variable a */
loop:   cmp     r0, #4     /* checking if a==4 */
        beq     end        /* proceeding to the end if a==4 */
        add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
        b loop             /* repeating the loop */
end:
        bx lr              /* THE END */

Here, b is a branch instruction with the label loop. The label loop points to the instruction cmp (compare). When the CPU reaches the b loop (line 8) instruction, the PC’s contents are updated by the value stored in the loop label which is nothing but the address of the cmp instruction. On the next cycle when the PC is fetched, the address of the cmp (line 5) is returned and the instruction is fetched by its address and executed. In this way, the branch instruction “makes a jump”. This is how a loop iterates over.

Thus, a high number of branch instructions can be an indicator of an excessive execution of iterations, in particular, a Nested Loop join.

The next step is finding some OS statistics that shows the number of branch instructions executing in real time. This is where the perf tool comes into play.

The perf tool is a performance profiler that can display various hardware and software statistics and counters. It can also be attached to a running process. For example, as follows:

# perf stat -p 4539
       1920.159821 task-clock                #    0.991 CPUs utilized          
                13 context-switches          #    0.007 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
               258 page-faults               #    0.134 K/sec                  
     5,649,595,479 cycles                    #    2.942 GHz                     [83.43%]
     1,808,339,931 stalled-cycles-frontend   #   32.01% frontend cycles idle    [83.54%]
     1,171,884,577 stalled-cycles-backend    #   20.74% backend  cycles idle    [66.77%]
     8,625,207,199 instructions              #    1.53  insns per cycle        
                                             #    0.21  stalled cycles per insn [83.51%]
     1,488,797,176 branches                  #  775.351 M/sec                   [82.58%]
        53,395,139 branch-misses             #    3.59% of all branches         [83.78%]

       1.936842598 seconds time elapsed

The command populates statistics until cancelled (CTRL+C). So, the perf command shows that by that time 1,488,797,176 branch instructions were executed out of which 53,395,139 the branch predictor failed to predict. The former is actually is the main metric that is of interest. If a high number of deltas is observed with the branches metric, then this indicates that the code is executing high number of iterations. This in turn, can direct us to suspect that (probably) a Nested Loop join method is running intensively. Peeking at this metric sampled over time and seeing that this number is unreasonably high can give a tip that the code needs to be checked for a loop that is iterating over and over again.

Side note: The Nested Loop join is not the only code path that can cause this. This can also be, for example, due to latch contention since a retry logic is an essential part of pretty much any latch implementation. Another example can be index scan: when the code iteratively walks through the index to find the key which (by the way) can be a part of the Nested Loop join. The approach presented should be applicable to any RDBMS.

In the next post, I will reproduce a highly iterative Nested Loop join and will analyze the metric mentioned. This will give a better insight into improving the idea briefly presented.

The Hidden “INDEX UNIQUE SCAN”

Oracle optimizer can sometimes trick you and this post is about a finding that clearly demonstrates that. Putting it shortly, while the execution plan shows that the optimizer is doing “INDEX RANGE SCANE”, in actuality it is going with “INDEX UNIQUE SCAN”. Here is a test-case in Oracle 19c.

First, lets create a table, populate it with data and add index on its column.

SQL> create table table1(id integer);

Table created.

SQL> insert into table1 select object_id from dba_objects;

7433 rows created.

SQL> create index IND on table1(id);

Index created.

Let’s confirm the type of the index just created which will be needed a little later

SQL> select UNIQUENESS from dba_indexes where index_name = 'IND';

UNIQUENES
---------
NONUNIQUE

Since the index is nonunique, the optimizer will (or should !?) never go with “INDEX UNIQUE SCAN”.

Let’s now add a primary key.

SQL> alter table table1 add primary key(id);

Table altered.

Since an index on the column ID already exists, Oracle doesn’t attempt to create a new one. It will link the constraint to the existing one. We can confirm it by querying DBA_CONSTRAINTS view. Additionally, the database does not update the index type in the data dictionary.

SQL> select constraint_name, index_name from dba_constraints where table_name = 'TABLE1';

CONSTRAINT_NAME                INDEX_NAME
------------------------------ ------------------------------
SYS_C008559                    IND

SQL> select UNIQUENESS from dba_indexes where index_name = 'IND';

UNIQUENES
---------
NONUNIQUE

What we have so far is that the table has a primary key constraint enforced via an index which Oracle reports as NONUNIQUE (which effectively is not so).

It is interesting to see what index access path will be taken by the optimizer for a query with equality predicate.

select * from table1 where id = 2;
============
Plan Table
============
------------------------------------+-----------------------------------+
| Id  | Operation         | Name    | Rows  | Bytes | Cost  | Time      |
------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT  |         |       |       |     1 |           |
| 1   |  INDEX RANGE SCAN | IND     |     1 |    13 |     0 |           |
------------------------------------+-----------------------------------+

The optimizer reports “INDEX RANGE SCAN”. According to the type of the index in the database dictionary this is expected. However, we definitely know that the index is unique since this is what a primary key constraint is enforced by and unique index should not be causing range scans.

Let’s see what session statistics related to indexes the database reports.

select s.name, value from v$mystat m, v$statname s where s.STATISTIC# = m.STATISTIC#
and s.name like '%index%' 
and value != 0;

NAME                        VALUE
--------------------------- ----------
index range scans           1

So, the statistics also state that the optimizer chooses the “INDEX RANGE SCAN” index access method (this statistics is missing in the Database Reference 19c).

Finally, let’s take a look at the optimizer trace (Access path analysis section only) to see what really happens behind the scenes.

 Access Path: TableScan
    Cost:  2.000631  Resp: 2.000631  Degree: 0
      Cost_io: 2.000000  Cost_cpu: 23521
      Resp_io: 2.000000  Resp_cpu: 23521
  Access Path: index (index (FFS))
    Index: IND
    resc_io: 2.000000  resc_cpu: 0
    ix_sel: 0.000000  ix_sel_with_filters: 1.000000
  Access Path: index (FFS)
    Cost:  2.000000  Resp: 2.000000  Degree: 1
      Cost_io: 2.000000  Cost_cpu: 0
      Resp_io: 2.000000  Resp_cpu: 0
 ****** Costing Index IND
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
  Access Path: index (UniqueScan)
    Index: IND
    resc_io: 0.000000  resc_cpu: 1050
    ix_sel: 0.004000  ix_sel_with_filters: 0.004000
    Cost: 0.000028  Resp: 0.000028  Degree: 1
 ****** Costing Index IND
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
  Access Path: index (AllEqUnique)
    Index: IND
    resc_io: 0.000000  resc_cpu: 1050
    ix_sel: 0.004000  ix_sel_with_filters: 0.004000
    Cost: 0.000028  Resp: 0.000028  Degree: 1
 One row Card: 1.000000
  Best:: AccessPath: IndexUnique
  Index: IND
         Cost: 0.000028  Degree: 1  Resp: 0.000028  Card: 1.000000  Bytes: 0.000000

Whola ! In actuality, no “INDEX RANGE SCAN” happens but “INDEX UNIQUE SCAN”. This is confirmed by the access path choice the optimizer ends up with: Best:: AccessPath: IndexUnique (line 30). The “INDEX RANGE SCAN” access path is not even considered as an option. Apparently, the session statistics gets the data from what the optimizer shows up in the execution plan.

In summary, while the optimizer reports “INDEX RANGE SCAN” access method it may not actually be doing so but instead “INDEX UNIQUE SCAN”. The execution plan is not the source of truth in such a case.

The “force merge” process marks indexes searchable in Elasticsearch

It is well known that once Elasticsearch (ES) indexes a document, it does not become searchable right away. For index to be so, the refresh must run which internally copies the documents to a new segment in the OS cache and opens it for searches. Once the index has been refreshed, all documents within it become searchable.

According to ES documentation the refresh process by default runs every second. Alternatively, one can run it manually as well. It can also be disabled entirely by setting “refresh_interval” setting of the index to -1.

A simple experiment reveals that there is another (an undocumented) case which can make index searchable. This is the force merge.

This process runs in the background and periodically merges segments into larger segments. This is necessary in case of updated and/or deleted documents since ES does not implement these operations in-place. For example, when a document is updated, ES creates a new document with a higher version number and the old document is kept until the next merge process which when run out of the several versions of the same document keeps the one with the highest version number.

Let’s put up an experiment and check out how the merge process affects searchability of the index.

First, let’s add an index:

curl -X PUT "localhost:9200/index_custom?pretty"

Let’s then make sure that the refresh is disabled by setting the value of “refresh_interval” to -1:

curl -X PUT "localhost:9200/index_custom/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index" : {
    "refresh_interval": "-1"
  }
}
'

Now, let’s add a couple of documents to the index (with ids 1 and 2 respectively):

curl -X PUT "localhost:9200/index_custom/_doc/1?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{     
    "id": "kimchy",
    "name": "kimchy1"
}
'
curl -X PUT "localhost:9200/index_custom/_doc/2?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{     
    "id": "kimchy",
    "name": "kimchy2"
}
'

Let’s now flush it to disk to ensure that segments have been created:

curl -X POST "localhost:9200/index_custom/_flush?pretty"

Confirm the number of documents in the index (docs.count):

curl 'localhost:9200/_cat/indices?v'
health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index_custom       jqTFVQm0R_u9bL2kIGQrUA   1   1          2            0      4.8kb           225b

Let’s now search a document

curl -X GET "localhost:9200/index_custom/_search?pretty&pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
        "term" : { "id" : "kimchy" }
    }
}
'

and check the response

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

As expected, no document is returned since the index has not been refreshed yet.

Let’s confirm it by checking the statuses of the segments of the index. Notice that the value of the “searchable” column is false for both of the segments.

curl -X GET "localhost:9200/_cat/segments?v=true&pretty"
index              shard prirep ip          segment generation docs.count docs.deleted     size size.memory committed searchable version compound

index_custom       0     r      localhost_0               0          2            0    4.3kb           0 true      false      9.0.0   true
index_custom       0     p      localhosd_0               0          2            0    4.3kb           0 true      false      9.0.0   true

Finally, let’s manually invoke the merge process

curl -X POST "localhost:9200/index_custom/_forcemerge?pretty"

and rerun the query. As a result, the documents are returned which means than indexes have become searchable:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "index_custom",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy1"
        }
      },
      {
        "_index" : "index_custom",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy2"
        }
      }
    ]
  }
}

This can also be confirmed by doublechecking the value of the “searchable” column in the list of segments.

Conclusion: The refresh process is not the only one that makes index searchable. The index can be made so by force merging as well. This is not documented.