Diving deeper into the LWLock in PostgreSQL.

Lightweight locks (LWLocks) are a type of interprocess lock used by PostgreSQL to coordinate simultaneous access of multiple processes to a shared memory region. There are various reasons that can cause this lock to create issues in the database. I will now provide a short yet concise demo, followed by an exploration of the nature of LWLock in PostgreSQL through reference to its source code and OS stuff.

As a first step, I will generate some load on the database using pgbench. While not extremely flexible, this tool is adequate for testing some basics. The provided inputs are as follows.

[postgres@centos8 ~]$ pgbench -c 10 -j 10 -t 100000 -S tpcc
pgbench (15.5)
starting vacuum...end.

It spawns 10 (-j) worker threads to execute 10 (-c) concurrent processes, each of which runs 100,000 read-only queries as fast as possible (this is the default, and the rate is configurable as well). Visually, it appears as follows:

This is certainly not the behavior one would expect from a well-performing database with reasonable response times. The dark read LWLock prevails and we certainly need to bring it down. Typically, the LWLock is accompanied by I/O, as is the case here. However, this is not always the scenario, as a few other reasons may exist. Let’s assume we have no information about the database and try to figure out what is going on.

PostgreSQL introduced the pgsentinel extension, which, when loaded, samples the active sessions every second by default and makes the information available through the pg_active_session_history view. This allows querying what the sessions were doing at a particular point in time or within a range, making it very useful when diagnosing performance issues:

[postgres@centos8 home]$ ./ash wait_event_type,wait_event "backend_type='client backend'" '2023-11-24 20:29:00' '2023-11-24 20:30:00'
 wait_event_type |    wait_event    | samples | sessions 
 CPU             | CPU              |     213 |       11
 LWLock          | BufferMapping    |     196 |       10
 IO              | DataFileRead     |      61 |       10
 IPC             | BufferIO         |      50 |       10
 IPC             | ExecuteGather    |       7 |        7
 IPC             | BgWorkerShutdown |       4 |        4
(6 rows)

This is a bash script which connects to the database and queries data from pg_active_session_history view. The first list (wait_event_type, wait_event) is the grouping columns followed by the filter, and followed by the range in which we are interested.

From the output (as well as the chart), it becomes evident that the highlighted LWLock is the culprit, with approximately 10 sessions waiting for BufferMapping. This wait event occurs when a process is waiting to map a page from the disk to a buffer in the shared buffers. Once the necessary disk page has been identified, the process must find a free buffer in the buffer pool to load the page into. This is precisely what the processes are currently awaiting. Before we proceed to rerun the script and examine additional dimensions that may be helpful, let’s draw reasonable few conclusions here.

If the process cannot find a free buffer in the buffer pool, then the most likely suspect is that the “Shared buffers is full“. The (misleading) path to take then would (or could) be something like “Well, if it is full, then let’s add up some space.” Although a feasible yet rare solution in practice, the correct approach to handling it would be to ask, ‘What is polluting it?”. Well…definitely, the queries :)). Let’s then find out what these queries look like (only top 20 rows are returned, by default).

[postgres@centos8 home]$ ./ash wait_event_type,wait_event,query "backend_type='client backend'" '2023-11-24 20:29:00' '2023-11-24 20:30:00'
 wait_event_type |  wait_event   |                           query                           | samples | sessions 
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 1679538 |       5 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 6508293 |       4 |        1
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 649929  |       4 |        1
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 6487070 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 5088721 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9953261 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1659220 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9321817 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9104089 |       3 |        1
 IO              | DataFileRead  | SELECT abalance FROM pgbench_accounts WHERE aid = 8129419 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1997345 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1172966 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 513858  |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9534579 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1393497 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 3166138 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 3735733 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 2971116 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 4772208 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 2960432 |       3 |        1
(20 rows)

All sessions are executing the same query but passing different literals. They look very innocent: a small point query which (apparently; because the column “aid” looks like some sort of ID) returns just a single row). Remember, we have no information about the database – nothing at all. That is, we cannot allow some names, IDs, and etc. put us on the wrong path.

We now know what is polluting the buffer pool. So, we have to see why this query polluting it. The query explain plan will show us how the details of the query execution:

tpcc=# explain (analyze, buffers) SELECT abalance FROM pgbench_accounts WHERE aid = 1679538;
                                                             QUERY PLAN                                                             
 Gather  (cost=1000.00..217018.43 rows=1 width=4) (actual time=2434.719..2434.813 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=12678 read=151257
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..216018.33 rows=1 width=4) (actual time=2166.930..2423.078 rows=0 loops=3)
         Filter: (aid = 1679538)
         Rows Removed by Filter: 3333333
         Buffers: shared hit=12678 read=151257
   Buffers: shared hit=41 read=1
 Planning Time: 6.413 ms
 Execution Time: 2434.930 ms
(12 rows)

The query reads 151257 pages from disk to return a single row (rows=1). Luckily, 12678 are found in the shared buffer but still, 1.5k pages is way too high to spit out a single row. The other interesting aspect here is the second highlighted line: “Rows Removed by Filter: 3333333“. In actuality, the database didn’t remove this number of rows by filter. It is much more than that: 9999999. The explanation is in another LinkedIn post of mine :)).

So, the query reads lots of pages, out of which only one holds the necessary row. This is what actually pollutes the buffer. Adding an index on the filtering column will definitely make it much more better. As a result, the query will read only two pages: one from the index, another one from the table (unless the table/index is bloated). This is how things got better after the column has been indexed:

[postgres@centos8 home]$ ./ash wait_event_type,wait_event "1=1" '2023-11-24 22:00:00' '2023-11-24 22:01:00'
 wait_event_type |  wait_event  | samples | sessions 
 Client          | ClientRead   |      36 |       10
 CPU             | CPU          |      16 |        8
 IO              | DataFileRead |      16 |        8
(3 rows)

LWLock is gone and not there anymore.

Now that it is clear how to deal with LWLock (at least in a typical case) related waits, it is also important to understand the implementation details of it. This time, we’ll delve into the details of how it is acquired (the release should be much simpler), and what actually counts as wait duration when the lock is being waited for. The current wait interface in PostgreSQL does not expose the duration of the wait. One of the first valid approaches is to examine the functions that execute when the process attempts to acquire the lock.

Let’s sample the stack trace with pstack and see what is going on (I’ve sampled it many times in a loop for a single process but will show only the top one here.):

[root@centos8 ~]# pstack 4391
#0  0x00007fd95fd9cda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd95fd9ce98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00000000007815a2 in PGSemaphoreLock (sema=0x7fd819711eb8) at pg_sema.c:327
#3  0x00000000007fca6c in LWLockAcquire (lock=lock@entry=0x7fd819715580, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1315
#4  0x00000000007dbf8f in BufferAlloc (foundPtr=0x7fff9354e59b, strategy=0x1c7da98, blockNum=134785, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x1c3d8f8) at bufmgr.c:1320
#5  ReadBuffer_common (smgr=0x1c3d8f8, relpersistence=<optimized out>, forkNum=MAIN_FORKNUM, blockNum=134785, mode=RBM_NORMAL, strategy=0x1c7da98, hit=0x7fff9354e647) at bufmgr.c:873
#6  0x00000000007dc6f0 in ReadBufferExtended (reln=0x7fd96169d1f8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=134785, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at ../../../../src/include/utils/rel.h:560
#7  0x0000000000549c66 in heapgetpage (sscan=sscan@entry=0x1c7cd18, page=page@entry=134785) at heapam.c:402
#8  0x000000000054a1f6 in heapgettup_pagemode (scan=scan@entry=0x1c7cd18, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1120
#9  0x000000000054b162 in heap_getnextslot (sscan=0x1c7cd18, direction=<optimized out>, slot=0x1c73270) at heapam.c:1352
#10 0x00000000006cff6b in table_scan_getnextslot (slot=0x1c73270, direction=ForwardScanDirection, sscan=<optimized out>) at ../../../src/include/access/tableam.h:1046
#11 SeqNext (node=node@entry=0x1c730d0) at nodeSeqscan.c:80
#12 0x00000000006a9c52 in ExecScanFetch (recheckMtd=0x6cfeb0 <SeqRecheck>, accessMtd=0x6cfee0 <SeqNext>, node=0x1c730d0) at execScan.c:132
#13 ExecScan (node=0x1c730d0, accessMtd=0x6cfee0 <SeqNext>, recheckMtd=0x6cfeb0 <SeqRecheck>) at execScan.c:198
#14 0x00000000006ba8fa in ExecProcNode (node=0x1c730d0) at ../../../src/include/executor/executor.h:262
#15 gather_getnext (gatherstate=0x1c72ee0) at nodeGather.c:295
#16 ExecGather (pstate=0x1c72ee0) at nodeGather.c:227

Relevant lines are highlighted. The BufferAlloc function is responsible for seeking a free buffer. Once found, it will call LWLockAcquire which attempts to acquire the lock on the buffer. The remaining three functions (0, 1, and 2) suggest that the process is likely using a semaphore (in addition, some futex-related operations are also present) when it fails to acquire the lock.

What cannot be seen from here is whether the process sleeps or spins, and if it sleeps, then for how long. Tracing system calls can help us here. Their frequency and inputs can help us discern the nature of the waiting. Let’s strace it and see:

pread64(6, "\6\0\0\0(\257\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62840832) = 8192
pread64(6, "\6\0\0\0\330\311\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62849024) = 8192
pread64(6, "\6\0\0\0h\343\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62857216) = 8192
pread64(6, "\6\0\0\0 \375\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62865408) = 8192
pread64(6, "\6\0\0\0\360\26\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62873600) = 8192
pread64(6, "\6\0\0\0\3000\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62881792) = 8192
pread64(6, "\6\0\0\0\220J\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62889984) = 8192
pread64(6, "\6\0\0\0`d\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62898176) = 8192
pread64(6, "\6\0\0\0\30~\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62906368) = 8192
futex(0x7fc549a67038, FUTEX_WAKE, 1)    = 1

This is a portion of the trace. The futex syscall is not that frequently there. Moreover, on lines 15 and 16 there are two of them with different inputs. The one on line 15 makes the calling thread to sleep until awakened. This is because the fourth parameter, which determines the timeout of the sleep, is NULL. The next futex call wakes up a waiting thread. The third input is 1, indicating that only one thread must be awakened.

We now know how long a thread sleeps when it fails to acquire the lock — until it is woken up. Additionally, we have determined that there is either no or very little spinning.

We now peek at the source code. The call stack above shows that the LWLockAcquire is there in the lwlock.c file (irrelevant lines have been replaced with dots):

LWLockAcquire(LWLock *lock, LWLockMode mode)
	for (;;)
		bool		mustwait;

		 * Try to grab the lock the first time, we're not in the waitqueue
		 * yet/anymore.
		mustwait = LWLockAttemptLock(lock, mode);

		if (!mustwait)
			LOG_LWDEBUG("LWLockAcquire", lock, "immediately acquired lock");
			break;				/* got the lock */
		/* add to the queue */
		LWLockQueueSelf(lock, mode);

		/* we're now guaranteed to be woken up if necessary */
		mustwait = LWLockAttemptLock(lock, mode);

		/* ok, grabbed the lock the second time round, need to undo queueing */
		if (!mustwait)
			LOG_LWDEBUG("LWLockAcquire", lock, "acquired, undoing queue");


		 * Wait until awakened.
		 * It is possible that we get awakened for a reason other than being
		 * signaled by LWLockRelease.  If so, loop back and wait again.  Once
		 * we've gotten the LWLock, re-increment the sema by the number of
		 * additional signals received.
		LOG_LWDEBUG("LWLockAcquire", lock, "waiting");



		for (;;)
			if (!proc->lwWaiting)

		/* Retrying, allow LWLockRelease to release waiters again. */
		pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_RELEASE_OK);

		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");

		/* Now loop back and try to acquire lock again. */
		result = false;
	return result;

This is how it works. First, the process attempts to acquire the lock (line 13), and if unsuccessful, it queues itself up (line 21). The next attempt follows (line 24), and if successful, the process dequeues itself (line 31), allowing it to proceed when awakened. If it fails, it starts sleeping on a semaphore (line 55). Right before the sleep, the process registers a wait by calling LWLockReportWaitStart (line 49) and ends the wait by calling LWLockReportWaitEnd (line 64). The sleep ends when the process is awakened (line 57), and it retries to acquire the lock (line 62).

Therefore, the process of LWLock acquisition is more about sleeping than spinning. The duration of the wait is the time a process sleeps on a semaphore, excluding external waits. Just a few spins are negligible in comparison to this.

More drill-down in future post…

Reasoning about OS Scheduler for Performance Degradation Analysis

It all started with a performance issue in one of our clusters. One of the components on the critical path turned out to be the OS scheduler which I analyzed to find out whether it was the offender itself or just a symptom, and the global slowness was caused by other factors in the stack. The whole story is very long and would take (I guess) several pages or so to describe everything. That is why I will just present here a short exert which is a part of the whole diagnosis journey. So, don’t be surprised to notice that the story does not (sort of) have neither the beginning nor the end.

I decided to peek at the OS scheduler as vmstat reported high run queue lengths. This led me to conclude that the former will highly likely bring me to the ultimate offender. The scheduler statistics can be traced with the perf command which has many options. Since the issue was ongoing for several minutes, few tens of seconds would suffice. Moreover, as I didn’t know which scheduler event should be focused upon, I traced all of them:

[fgenceliyev@machine1 ~]$ sudo perf sched record -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 50.183 MB perf.data (132,856 samples) ]

The next step was to check whether what I suspected was true and I was going in the right direction (the output is intentionally shortened):

[fgenceliyev@machine1 ~]$ sudo perf sched latency --sort max

  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
  MutationStage-1:(2)   |    240.607 ms |    13933 | avg:    0.083 ms | max:  416.618 ms | max at: 275770.668504 s
  C2 CompilerThre:(6)   |   9980.936 ms |     2010 | avg:    0.934 ms | max:  322.393 ms | max at: 275772.038103 s
  MutationStage-8:17873 |     50.710 ms |     1743 | avg:    0.248 ms | max:  310.998 ms | max at: 275771.587630 s
  MutationStage-5:17870 |     62.694 ms |     2213 | avg:    0.347 ms | max:  310.980 ms | max at: 275771.587612 s
  Thread-17:19000       |     49.637 ms |      783 | avg:    0.494 ms | max:  310.795 ms | max at: 275771.587998 s
  vmtoolsd:1972         |     17.737 ms |       73 | avg:    4.401 ms | max:  287.086 ms | max at: 275769.602360 s

No comments here: huge delays ! The are two good news here though: a) the output has a timestamp that shows when the maximum delay happened for a particular process (or thread) and b) it doesn’t matter which process from the output we now chase to drill down further since (whatever it is) all of them suffer from the same.

At this point, we know the delay is there but we don’t know where that delay occurs. In other words, we now need to split the delay into its parts and see where exactly time is spent most (we can take an arbitrary process to measure).

Let’s take the very first timestamp (275770.668504) for analysis and see how time is distributed. This can be achieved by using timehist option of perf which gives a breakdown of the events.

[fgenceliyev@machine1 ~]$ sudo perf sched timehist -f
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  275767.383658 [0001]  cluster1-nio-wo[18972/18944]        0.000      0.000      0.000 
  275767.383663 [0002]  <idle>                              0.000      0.000      0.000 
  275767.383664 [0000]  <idle>                              0.000      0.000      0.000 
  275767.383670 [0002]  MutationStage-5[17870/17693]        0.000      0.000      0.006 
  275767.383670 [0000]  MutationStage-1[17803/17693]        0.000      0.000      0.006 
  275767.383696 [0003]  cluster1-nio-wo[18970/18944]        0.000      0.000      0.000 
  275767.383697 [0001]  Thread-6[18989/18944]               0.000      0.000      0.039 
  275767.383699 [0005]  <idle>                              0.000      0.000      0.000
  275770.668504 [0004]  cluster1-nio-wo[18971/18944]      416.371      0.277      0.254

We are interested in two columns: wait time and sch delay. The first is the time a task was waiting to be woken up and the second is the time from the moment the context switch begins up to the point a task is scheduled in. The latter is the scheduler latency i.e. time it takes the scheduler to change the task’s status to Runnable so that it can then be switched to Running.

What we can see (line 16) is that the scheduler latency is very high: 0.277 msecs but what’s even worse is that the task has been waiting for 416 msecs just to be picked up by the scheduler. Such a situation is a little tricky as it resembles the ‘chicken and egg problem’ since either can cause the other one: the scheduler’s slow decision could be because of the same reasons that the task had to wait long and, conversely, high scheduler latencies can cause significant delays before a task receives a wake-up signal. Without any evidence we cannot judge neither to be the cause. If we don’t have enough evidence (and sometimes we don’t :)) ) we should be able to draw conclusions like what is more and what is less likely.

Since we don’t know which caused the other one we should attack each of them separately as if we don’t have the other issue at all. Thus, the first question is “What affects latency of scheduler decisions?”

Recall that the OS scheduler is just another kernel thread and to do its job it needs CPU time. If the CPU is saturated (for example, by doing heavy I/O or doing moderate I/O from a slow device) then the scheduler’s behavior can be affected and, hence, its latency may rise.

This is a three-node Apache Cassandra cluster. For each type of tasks (such as handling client requests, updating the tables and etc.) the database maintains a pool of threads each of which performs its job. We need to see what those threads are doing and, apparently, that will tell us the actions that burn the CPU (or which are slow). To see the breakdown by threads we will use the psn (Tanel Poder’s process snapper)

=== Active Threads =============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm2             | state                  | wchan                        | kstack                                                                                                                                                                                                                                                                                                                                                                                                                     
      32 |        4.00 | (Native-Transpor) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      20 |        2.50 | (MutationStage-1) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      19 |        2.38 | (Native-Transpor) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      18 |        2.25 | (Native-Transpor) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      16 |        2.00 | (java)            | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       8 |        1.00 | (MutationStage-1) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       7 |        0.88 | (Native-Transpor) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       6 |        0.75 | (Native-Transpor) | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->hrtimer_cancel()->hrtimer_try_to_cancel()                                                                                                                                                                                                                                                                                                                                                        
       5 |        0.62 | (MutationStage-6) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       5 |        0.62 | (epollEventLoopG) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (MutationStage-9) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       4 |        0.50 | (Native-Transpor) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (java)            | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                                                            
       3 |        0.38 | (MutationStage-1) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       3 |        0.38 | (Native-Transpor) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-1) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-2) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       2 |        0.25 | (MutationStage-7) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       2 |        0.25 | (MutationStage-8) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       1 |        0.12 | (MessagingServic) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          

The comm2 column shows a thread from the thread pool and the state column shows what that thread was doing at the time of sampling. What is clear (line 59) is that a number of threads is “ON CPU” but what also catches eye is that quite a portion of them are waiting for disk I/O to complete (Disk (Uninterruptible) state). Here I’m showing just one sample and not going into details of the values in wchain and kstack columns. Sampling the threads several times shows the same behavior. For a database like Cassandra (if configured properly) we should not observe many threads spending their time on I/O.

This leads us to suspect that the storage subsystem is not catching up. The iostat output is to the rescue (for the following two outputs I got only screens saved aside):

It takes 53 msecs to write 4.77 MB of data! Plus, avgqu-sz is low, thus, there is no queueing up. The next contact is the storage team.

Let’s also see bcc-tools output just for diversity. The last column is latency in milliseconds and the third column from the end is the amount of data (in bytes) written: 512 KB gets written in 0.4 secs !

The whole investigation journey could have been much shorter should we simply had run iostat and analyze the I/O latency. But that would be just a ‘matter of luck’. Even if the path is long, if you take the right direction from the very beginning it will definitely take you to the cause.

This is it for now! Hope you enjoyed !

Six Ways to Force “Sequential Scan” in PostgreSQL

Here are some six ways that can be used to force a query go with sequential scan access method in PostgresSQL. The objective is to eliminate any index access in the query. The test table looks as follows.

postgres=# create table tablescandemo as select generate_series(1,100000) id;
SELECT 100000
postgres=# create index ind1demo on testtable1(id);

Let’s get started.

Option I. Using enable_indexscan, enable_indexonlyscan and enable_bitmapscan parameters.
The enable_indexscan parameter disables index access path. Let’s try it out.

postgres=# set enable_indexscan = off;
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                     QUERY PLAN                                                     
 Bitmap Heap Scan on tablescandemo  (cost=12.17..482.50 rows=500 width=4) (actual time=0.039..0.041 rows=1 loops=1)
   Recheck Cond: (id = 100)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on inddemo  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.027 rows=1 loops=1)
         Index Cond: (id = 100)
 Planning Time: 0.568 ms
 Execution Time: 0.091 ms
(7 rows)

The planner still accesses the index but this time it is behaving differently. The index is accessed in the same way as in index scan method, however, as a result, the database doesn’t build a set of rowids of the heap rows but a bitmap: one bit per heap page. This is how the bitmap index scan works. The table is sequentially scanned and the bitmap is used to decide whether a particular page must be probed or not. The enable_bitmapscan parameter controls bitmap index scan access method:

postgres=# set enable_bitmapscan = off;
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                         QUERY PLAN                                                         
 Index Only Scan using inddemo on tablescandemo  (cost=0.29..4.31 rows=1 width=4) (actual time=0.070..0.071 rows=1 loops=1)
   Index Cond: (id = 100)
   Heap Fetches: 0
 Planning Time: 0.086 ms
 Execution Time: 0.094 ms
(5 rows)

Now, index only scan scan kicks in since the table has a single column which is indexed. Let’s turn it off as well:

postgres=# set enable_indexonlyscan = off;
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.021..9.114 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.065 ms
 Execution Time: 9.137 ms
(5 rows)

Basically, once needs to make sure that not just one but all index access methods (index scan, index only scan and bitmap index scan) are off.

Option II. Using random_page_cost parameter.
This parameter tells the optimizer the cost of a randomly accessed disk page. This concerns both index and table pages. Therefore, the higher this value, the more it should cost the optimizer to use index access path, especially, if the index is not very selective. Let’s see the experiment results (index scan, index only scan and bitmap index scan control parameter values are all on now):

postgres=# set random_page_cost = 100000;
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                 
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.034..10.175 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.071 ms
 Execution Time: 10.200 ms
(5 rows)

The optimizer estimates that retrieving tuple(s) via the index will be higher than sequential scan and, hence, prefers solely the latter.

Option III. Altering table statistics.
At the end of the day, optimizer uses database statistics which one can actually fake. Selectivity of the predicate is the primary factor in deciding which access path to take. It is based on the cardinality which itself calculated as the number of tuples that qualify the predicate. The latter is (assuming no extended statistics exist) nothing but N/NDV where N is the number of tuples in the table and NDV is the number of distinct values of the column. The higher the NDV, the more selective the predicate is, and, thus, the higher the probability that the optimizer leans towards index access rather than sequential scan. Therefore, if NDV is low, the optimizer will think that most of the table tuples qualify: that is, index access will cost much more than sequential scan, thus, preferring the latter. Let’s check it out.

postgres=# alter table tablescandemo alter column id set (n_distinct = 1);
postgres=# analyze tablescandemo;
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                  QUERY PLAN                                                   
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=100000 width=4) (actual time=0.015..6.366 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.098 ms
 Execution Time: 6.379 ms
(5 rows)

Option IV. ‘Hacking’ table statistics.
Even though the previous option is a valid one, it has two drawbacks: 1) alter table command will lock the table in exclusive mode and 2) analyze table may take considerable amount of time depending on the size of the table. So, if that is the case, then you can update statistics directly in the pg_statistic table. Surely, this is not an accurate way but still works and in extreme cases may come in handy:

postgres=# update pg_statistic set stadistinct = 1 where starelid = (select oid from pg_class where relname = 'tablescandemo');;
postgres=# explain analyze select * from tablescandemo where id = 6;
                                             QUERY PLAN                                              
 Seq Scan on tablescandemo(cost=0.00..1693.00 rows=100000 width=4) (actual time=0.008..6.612 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.115 ms
 Execution Time: 6.625 ms
(5 rows)


Option V. ‘Hacking’ table statistics (Part II).
Another statistics you can fake is number of pages a table contains. The lesser it is, the higher the probability that sequential scan will kick in. This information is kept in pg_class table and can be manipulated just like any other column value:

postgres=# update pg_class set reltuples=1 where relname = 'tablescandemo';
postgres=# explain analyze select * from tablescandemo where id = 6;
                                          QUERY PLAN                                           
 Seq Scan on tablescandemo(cost=0.00..443.01 rows=1 width=4) (actual time=0.010..6.496 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.138 ms
 Execution Time: 6.511 ms
(5 rows)

Option VI. Using hints.
PostgreSQL does not support hints out-of-the-box. However, you can use pg_hint_plan to hint queries. It provides hints for scan methods, join methods, join order, parallelism, row number correction and few others.

Ohh…yes…and option VII: drop all indexes :)).

Parallel Query and %logons% Statistics

Here is a quick and interesting observation regarding the %logons% statistics. The database version is Oracle 19c and, apparently, it applies to earlier versions as well. The question that I am trying to answer here is this “How does a parallel query execution relate to %logon% statistics?”

Here is a test case.

Let’s check the statistics for the instance:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68368
logons current				      108
user logons cumulative            6762


Now, do nothing but log on to the database:

Connected to:
Oracle Database 19c Enterprise Edition Release - Production


Let’s check the statistics again:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68369
logons current				      109
user logons cumulative            6763


As a result, each of the statistics above gets incremented by one. So far, so good.

Now, from within that brand-new session, let’s execute a parallel query:

SQL> select /*+ parallel(8) */ count(*) from tt1;



Display the statistics to compare:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68377
logons current				      109
user logons cumulative            6763


As a result, the only statistics that changes is “logons cumulative“. It has grown by exactly the number of allocated parallel slaves: 8. Repetitive execution of the same query with different degrees of parallelism increments only this statistics and it does this by that degree each time.

The “logons current” statistics doesn’t change. This one doesn’t include the slave sessions but the coordinator only. The coordinator sid is the actually the sid of the session that executes the query. Another observation is that “user logons cumulative” statistics does not include slave processes either.

This may come in very handy when troubleshooting a logon storm caused by parallel execution.

Implicit Population of Materialized Views in In-Memory Column Store

The quickest and simplest way of ensuring what has been populated in In-Memory column store (IMCS) is to query the V$IM_SEGMENTS view. Whenever an object is loaded into IMCS, the database will add a record about it into this view. In general, I guess the appearance of the IMCS as such is an attempt of Oracle to transition to NewSQL stack.

According to Oracle documentation the IMCS can be enabled for a tablespace, table (including external and excluding IOT), table partition and materialized view (MV). What is more interesting is that it also mentions when, for example, a table gets populated in IMCS – only when it is fully scanned (and not parallelly). Index access methods that cover table tuples and table tuples populated by rowids never load the tuples in IMCS.

Documentation does not say a word about when a MV gets populated in IMCS. It would be reasonable to assume that since a MV is very much like a heap table, then the table IMCS population rules apply to the former as well i.e. for a MV to get populated in IMCS an explicit nonparallel full scan of it is necessary. This is indeed so. Additionally, implicit population of MV is also possible and this with a little trick only. The table access which causes query rewrite (fully accessing the MV) can also trigger it. All this said, this post is not about something advanced but rather an interesting take-away about IMCS.

The test case is on Oracle 12C. First, let’s create a non-empty table and define a MV on it:

SQL> create table testtable1 as select object_id, object_name from dba_objects;

Table created.

SQL> create materialized view testview1 enable query rewrite as select * from testtable1;

Materialized view created.


Let’s now enable IMCS for the view and confirm that the it really gets populated in there when scanned fully:

SQL> alter materialized view testview1 inmemory;

Materialized view altered.

SQL> select count(*) from testview1;


SQL> select segment_name, populate_status from v$im_segments;

SEGMENT_NAME                        POPULATE_STATUS
----------------------------------- -------------
TESTVIEW1			                 COMPLETED

Now, let’s flush the MV from IMCS but make it eligible for population again. Also, let’s query the table so that the query rewrite kicks in:

SQL> alter materialized view testview1 no inmemory;

Materialized view altered.

SQL> alter materialized view testview1 inmemory;

SQL> select segment_name, populate_status from v$im_segments;

no rows selected.


As the result from v$im_segments shows, the MV is not there…yet. Let’s cause a full scan of the MV again:

SQL> select count(*) from testview1;


SQL> select segment_name, populate_status from v$im_segments;

no rows selected.


Subsequent executions of the same query still doesn’t make the MV appear in IMCS, although the little trick will make it to get populated:

SQL> alter table testtable1 inmemory;

Table altered.

SQL> select count(*) from testtable1;


SQL> select segment_name, populate_status from v$im_segments;

------------------------------ -------------
TESTVIEW1		                COMPLETED


The query plan now is as follows:

Plan hash value: 3466806115
| Id  | Operation                                  | Name      | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
|   0 | SELECT STATEMENT                           |           |     1 |     3   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                            |           |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR                           |           |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)                     | :TQ10000  |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                         |           |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR                     |           | 73306 |     3   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       MAT_VIEW REWRITE ACCESS INMEMORY FULL| TESTVIEW1 | 73306 |     3   (0)| 00:00:01 |  Q1,00 | PCWP |            |

The next time the query will execute the table data will be retrieved from the MV which is now in IMCS (the reason for going parallel is described in the note 2373556.1 which is a bug and has a workaround). The query rewrite can cause MV population in IMCS and for that to happen the base table must have the INMEMORY option enabled.

That’s all for this time. More advanced stuff is coming soon…

Parallel Query and V$SQL

The following is a (short and quick) scenario I recently faced when analyzing a parallel query and its response time. This may not apply to earlier versions of Oracle though. The experiments were carried out in 19c.

The ‘weird’ behavior I observed was that V$SQL.EXECUTIONS column didn’t get updated for the query under analysis when executed. The first gotcha is, perhaps, “That must be a bug“. Let’s be more systematic however, and arrive at the root cause gradually.

Here is a test case. First, I am simply executing the query three times and checking the v$sql.

SQL> select sql_id, sql_text, executions, child_number from v$sql where sql_text like '%test1%';

SQL_ID	      SQL_TEXT                              EXECUTIONS CHILD_NUMBER
------------- ---------------------------------     ---------- ------------
4gy00kap2kf3n	select /*+ test1 */ * from table1   3	        0
4gy00kap2kf3n	select /*+ test1 */ * from table1   0	        1

2 rows selected.


The output shows that the database decided to generate another child (CHILD_NUMBER=1) for this statement. However, the primary question here is “Why are there NO executions for the second child?“.

The reason for not sharing that can be found from v$sql_shared_cursor is not of interest, although it may shed some light on the question. Still, let’s take a different strategy without consulting this view.

Two approaches can be taken to move forward with this ‘issue’: a) Try A, try B, try C,… and hope it helps to find out the reason, and perhaps, eventually one of them will and b) Ask the right questions. Before blaming the database for bugs, lets put it aside as the last resort. There must be something about this that at this point not known yet and that may lead to the clue.

The simplest question one can ask here can be something like “Would there by any other statistics that applies to child cursors which has probably been missed out?” This is a separate child and, thus, the entry point should be v$sql. Scanning the column names of the view for ‘%EXEC’% reveals another field: PX_SERVERS_EXECUTIONS. Let’s query this column as well:

SQL> select sql_id, sql_text, executions, px_servers_executions, child_number from v$sql where sql_text like '%test1%';

------------- ---------------------------------- ---------- --------------------- ------------

4gy00kap2kf3n select /*+ test1 */ * from table1  3			 0	                   0
4gy00kap2kf3n select /*+ test1 */ * from table1  0			 12	                   1

2 rows selected.


PX_SERVERS_EXECUTIONS holds the total number of executions by the slaves of a parallel query. This explains why the second cursor executions are not reflected in V$SQL.EXECUTIONS column. For parallel slaves Oracle tracks it in a different column: V$SQL.PX_SERVERS_EXECUTIONS and in this case it has been executed 12 times so far. Since the query does have any parallelism forced by the hint, hence the table-level parallelism must set which is confirmed by dba_tables.degree being equal to 4. The one with CHILD_NUMBER=0 is actually the coordinator. The query executes in parallel:

SQL> select * from dbms_xplan.display_cursor('4gy00kap2kf3n', 0, 'ALLSTATS ALL');

SQL_ID	4gy00kap2kf3n, child number 0
select /*+ test1 */ * from table1

Plan hash value: 1574891232
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
|   0 | SELECT STATEMENT     |          | 73216 |  9438K|   110   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)| :TQ10000 | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX BLOCK ITERATOR |          | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   4 |     TABLE ACCESS FULL| TABLE1   | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | PCWP |            |
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (E - Syntax error (1))
   1 -  SEL$1
   E -  test1
   - Degree of Parallelism is 4 because of table property

28 rows selected.


Morale: Step-by-step approach is the key regardless of the issue and the DBMS.

A potential method for determining the join method without explain plan

I will not pretend that the idea presented in this post is aimed at addressing a number of troubleshooting scenarios but in some cases it may come in handy. For example, imagine a situation such that you cannot log in to the database because of a long-running query which makes it unresponsive. However, the OS is running fine and can be worked with.

The simple and not-yet-so-mature approach presented here should help one judge about the potential join method that a query is performing. At least, it may give certain (though incomplete) ground to reason about it.

Given all other conditions are same, the Nested Loop join method is the one that will execute the highest number of iterations to fetch the requested tuples. The other two methods, Hash Join and Sort-Merge Join will face less number of iterations and, perhaps, less number of loops.

The scenario with the assumptions above is not a rare. However, the question can be “Once you start suspecting that it is most probably the Nested Loop join that’s causing the hang, how can it help?”. At least, one can tell the software developers to fix the query by changing the join method which should solve the issue. Since (as assumed) the database is unresponsive, one cannot peek at the explain plan and tell for sure. What can help though is “expressing the join method via loops”.

To do this, let’s step aside a little.

When the compiler converts a loop in high-level code (while/for) to assembly, it maps it to a Branch instruction. The branch instruction updates the contents of the Program Counter (PC; also known as Extended Instruction Pointer in x86) register by the address stored in the Branch Target Address (BTA). On the next cycle the CPU fetches the instruction from the PC (which is the address of the instruction pointed to by the branch instruction) and executes it. To illustrate it, here is an assembly example for ARM processors:

.global main

        mov     r0, #0     /* setting up initial variable a */
loop:   cmp     r0, #4     /* checking if a==4 */
        beq     end        /* proceeding to the end if a==4 */
        add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
        b loop             /* repeating the loop */
        bx lr              /* THE END */

Here, b is a branch instruction with the label loop. The label loop points to the instruction cmp (compare). When the CPU reaches the b loop (line 8) instruction, the PC’s contents are updated by the value stored in the loop label which is nothing but the address of the cmp instruction. On the next cycle when the PC is fetched, the address of the cmp (line 5) is returned and the instruction is fetched by its address and executed. In this way, the branch instruction “makes a jump”. This is how a loop iterates over.

Thus, a high number of branch instructions can be an indicator of an excessive execution of iterations, in particular, a Nested Loop join.

The next step is finding some OS statistics that shows the number of branch instructions executing in real time. This is where the perf tool comes into play.

The perf tool is a performance profiler that can display various hardware and software statistics and counters. It can also be attached to a running process. For example, as follows:

# perf stat -p 4539
       1920.159821 task-clock                #    0.991 CPUs utilized          
                13 context-switches          #    0.007 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
               258 page-faults               #    0.134 K/sec                  
     5,649,595,479 cycles                    #    2.942 GHz                     [83.43%]
     1,808,339,931 stalled-cycles-frontend   #   32.01% frontend cycles idle    [83.54%]
     1,171,884,577 stalled-cycles-backend    #   20.74% backend  cycles idle    [66.77%]
     8,625,207,199 instructions              #    1.53  insns per cycle        
                                             #    0.21  stalled cycles per insn [83.51%]
     1,488,797,176 branches                  #  775.351 M/sec                   [82.58%]
        53,395,139 branch-misses             #    3.59% of all branches         [83.78%]

       1.936842598 seconds time elapsed

The command populates statistics until cancelled (CTRL+C). So, the perf command shows that by that time 1,488,797,176 branch instructions were executed out of which 53,395,139 the branch predictor failed to predict. The former is actually is the main metric that is of interest. If a high number of deltas is observed with the branches metric, then this indicates that the code is executing high number of iterations. This in turn, can direct us to suspect that (probably) a Nested Loop join method is running intensively. Peeking at this metric sampled over time and seeing that this number is unreasonably high can give a tip that the code needs to be checked for a loop that is iterating over and over again.

Side note: The Nested Loop join is not the only code path that can cause this. This can also be, for example, due to latch contention since a retry logic is an essential part of pretty much any latch implementation. Another example can be index scan: when the code iteratively walks through the index to find the key which (by the way) can be a part of the Nested Loop join. The approach presented should be applicable to any RDBMS.

In the next post, I will reproduce a highly iterative Nested Loop join and will analyze the metric mentioned. This will give a better insight into improving the idea briefly presented.


Oracle optimizer can sometimes trick you and this post is about a finding that clearly demonstrates that. Putting it shortly, while the execution plan shows that the optimizer is doing “INDEX RANGE SCANE”, in actuality it is going with “INDEX UNIQUE SCAN”. Here is a test-case in Oracle 19c.

First, lets create a table, populate it with data and add index on its column.

SQL> create table table1(id integer);

Table created.

SQL> insert into table1 select object_id from dba_objects;

7433 rows created.

SQL> create index IND on table1(id);

Index created.

Let’s confirm the type of the index just created which will be needed a little later

SQL> select UNIQUENESS from dba_indexes where index_name = 'IND';


Since the index is nonunique, the optimizer will (or should !?) never go with “INDEX UNIQUE SCAN”.

Let’s now add a primary key.

SQL> alter table table1 add primary key(id);

Table altered.

Since an index on the column ID already exists, Oracle doesn’t attempt to create a new one. It will link the constraint to the existing one. We can confirm it by querying DBA_CONSTRAINTS view. Additionally, the database does not update the index type in the data dictionary.

SQL> select constraint_name, index_name from dba_constraints where table_name = 'TABLE1';

------------------------------ ------------------------------
SYS_C008559                    IND

SQL> select UNIQUENESS from dba_indexes where index_name = 'IND';


What we have so far is that the table has a primary key constraint enforced via an index which Oracle reports as NONUNIQUE (which effectively is not so).

It is interesting to see what index access path will be taken by the optimizer for a query with equality predicate.

select * from table1 where id = 2;
Plan Table
| Id  | Operation         | Name    | Rows  | Bytes | Cost  | Time      |
| 0   | SELECT STATEMENT  |         |       |       |     1 |           |
| 1   |  INDEX RANGE SCAN | IND     |     1 |    13 |     0 |           |

The optimizer reports “INDEX RANGE SCAN”. According to the type of the index in the database dictionary this is expected. However, we definitely know that the index is unique since this is what a primary key constraint is enforced by and unique index should not be causing range scans.

Let’s see what session statistics related to indexes the database reports.

select s.name, value from v$mystat m, v$statname s where s.STATISTIC# = m.STATISTIC#
and s.name like '%index%' 
and value != 0;

NAME                        VALUE
--------------------------- ----------
index range scans           1

So, the statistics also state that the optimizer chooses the “INDEX RANGE SCAN” index access method (this statistics is missing in the Database Reference 19c).

Finally, let’s take a look at the optimizer trace (Access path analysis section only) to see what really happens behind the scenes.

 Access Path: TableScan
    Cost:  2.000631  Resp: 2.000631  Degree: 0
      Cost_io: 2.000000  Cost_cpu: 23521
      Resp_io: 2.000000  Resp_cpu: 23521
  Access Path: index (index (FFS))
    Index: IND
    resc_io: 2.000000  resc_cpu: 0
    ix_sel: 0.000000  ix_sel_with_filters: 1.000000
  Access Path: index (FFS)
    Cost:  2.000000  Resp: 2.000000  Degree: 1
      Cost_io: 2.000000  Cost_cpu: 0
      Resp_io: 2.000000  Resp_cpu: 0
 ****** Costing Index IND
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
  Access Path: index (UniqueScan)
    Index: IND
    resc_io: 0.000000  resc_cpu: 1050
    ix_sel: 0.004000  ix_sel_with_filters: 0.004000
    Cost: 0.000028  Resp: 0.000028  Degree: 1
 ****** Costing Index IND
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
  Access Path: index (AllEqUnique)
    Index: IND
    resc_io: 0.000000  resc_cpu: 1050
    ix_sel: 0.004000  ix_sel_with_filters: 0.004000
    Cost: 0.000028  Resp: 0.000028  Degree: 1
 One row Card: 1.000000
  Best:: AccessPath: IndexUnique
  Index: IND
         Cost: 0.000028  Degree: 1  Resp: 0.000028  Card: 1.000000  Bytes: 0.000000

Whola ! In actuality, no “INDEX RANGE SCAN” happens but “INDEX UNIQUE SCAN”. This is confirmed by the access path choice the optimizer ends up with: Best:: AccessPath: IndexUnique (line 30). The “INDEX RANGE SCAN” access path is not even considered as an option. Apparently, the session statistics gets the data from what the optimizer shows up in the execution plan.

In summary, while the optimizer reports “INDEX RANGE SCAN” access method it may not actually be doing so but instead “INDEX UNIQUE SCAN”. The execution plan is not the source of truth in such a case.

The “force merge” process marks indexes searchable in Elasticsearch

It is well known that once Elasticsearch (ES) indexes a document, it does not become searchable right away. For index to be so, the refresh must run which internally copies the documents to a new segment in the OS cache and opens it for searches. Once the index has been refreshed, all documents within it become searchable.

According to ES documentation the refresh process by default runs every second. Alternatively, one can run it manually as well. It can also be disabled entirely by setting “refresh_interval” setting of the index to -1.

A simple experiment reveals that there is another (an undocumented) case which can make index searchable. This is the force merge.

This process runs in the background and periodically merges segments into larger segments. This is necessary in case of updated and/or deleted documents since ES does not implement these operations in-place. For example, when a document is updated, ES creates a new document with a higher version number and the old document is kept until the next merge process which when run out of the several versions of the same document keeps the one with the highest version number.

Let’s put up an experiment and check out how the merge process affects searchability of the index.

First, let’s add an index:

curl -X PUT "localhost:9200/index_custom?pretty"

Let’s then make sure that the refresh is disabled by setting the value of “refresh_interval” to -1:

curl -X PUT "localhost:9200/index_custom/_settings?pretty" -H 'Content-Type: application/json' -d'
  "index" : {
    "refresh_interval": "-1"

Now, let’s add a couple of documents to the index (with ids 1 and 2 respectively):

curl -X PUT "localhost:9200/index_custom/_doc/1?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
    "id": "kimchy",
    "name": "kimchy1"
curl -X PUT "localhost:9200/index_custom/_doc/2?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
    "id": "kimchy",
    "name": "kimchy2"

Let’s now flush it to disk to ensure that segments have been created:

curl -X POST "localhost:9200/index_custom/_flush?pretty"

Confirm the number of documents in the index (docs.count):

curl 'localhost:9200/_cat/indices?v'
health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index_custom       jqTFVQm0R_u9bL2kIGQrUA   1   1          2            0      4.8kb           225b

Let’s now search a document

curl -X GET "localhost:9200/index_custom/_search?pretty&pretty" -H 'Content-Type: application/json' -d'
  "query" : {
        "term" : { "id" : "kimchy" }

and check the response

  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]

As expected, no document is returned since the index has not been refreshed yet.

Let’s confirm it by checking the statuses of the segments of the index. Notice that the value of the “searchable” column is false for both of the segments.

curl -X GET "localhost:9200/_cat/segments?v=true&pretty"
index              shard prirep ip          segment generation docs.count docs.deleted     size size.memory committed searchable version compound

index_custom       0     r      localhost_0               0          2            0    4.3kb           0 true      false      9.0.0   true
index_custom       0     p      localhosd_0               0          2            0    4.3kb           0 true      false      9.0.0   true

Finally, let’s manually invoke the merge process

curl -X POST "localhost:9200/index_custom/_forcemerge?pretty"

and rerun the query. As a result, the documents are returned which means than indexes have become searchable:

  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    "max_score" : 0.18232156,
    "hits" : [
        "_index" : "index_custom",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy1"
        "_index" : "index_custom",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy2"

This can also be confirmed by doublechecking the value of the “searchable” column in the list of segments.

Conclusion: The refresh process is not the only one that makes index searchable. The index can be made so by force merging as well. This is not documented.

Deciphering PostgreSQL process swap space utilization

One of the questions that a colleague of mine recently asked was: “How can I find out how much swap space a PostgreSQL process is really using?”. This triggered a piece of curiosity which I will talk about in this post.

At first, it might seem like we can use the Linux top command which has memory related columns. Let’s run it for an arbitrary process and see the output:

[root@localhost ~]# top -p 3748

top - 00:42:16 up 3 days,  7:43,  5 users,  load average: 0.00, 0.01, 0.05
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16266588 total, 15392968 free,   233036 used,   640584 buff/cache
KiB Swap:  1679356 total,  1615172 free,    64184 used. 15566964 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND               SWAP
3748 postgres  20   0 8851352   4520   3344 S   0.0  0.0   0:00.01 postmaster             972

The top command reports that the process (pid 3748, line 9, last column) is utilizing 972 KB of swap space. There are three questions here:
1) Is this really swap and the OS does not fake it?
2) If it is, then how accurate is this number?
3) Why does the OS allocate swap to this process while there is free RAM?

Here, I will try to answer the first two questions (third one is for another post).

According to the Linux top manual the reported swap column can include any of the quadrants from #1 through #3 each of which includes memory pages allocated via certain functions. For example, quadrant #1 includes process stack and this does not actually tell anything about being swapped. It only tells us that memory pages allocated for process stack are eligible for being swapped in/out.

So, how can we find out then how much of the process address space is mapped and really is on the swap file?

To answer this question, let’s get back to some fundamentals.

A Linux process is provided with a virtual address range. A page from an address from this range can be allocated either from RAM or secondary storage (HDD, SSD, etc.). The process page table maintains mapping from a virtual address to a physical address. Moreover, it also holds control bits each of which tracks some information about the page, e.g., dirty bit in the page table indicates whether the page’s content is in sync with the disk, unevictable bit indicates that the page cannot be evicted from the LRU list, and so on. This information can be queried from /proc/kpageflags file which in the latest kernel holds 26 flags.

Two of the flags are related to swapping out of which swapcache is the one we need. This bit indicates whether the page table entry (PTE) has an associated swap entry.

Let’s now see how we can calculate the total swap space occupied by a process.

The environment for experiments is PostgreSQL 14 running on CentOS 7.7. Let’s first connect to the database and retrieve it’s OS PID to have a process to work with:

-bash-4.2$ psql postgres
psql (14.2)
Type "help" for help.

postgres=# select pg_backend_pid();
(1 row)

The next step is to get swapcache bit value for every page in every PTE for this process. This information is available in /proc/$PID/pagemaps from where we can find out which physical frame a virtual page is mapped to (the process page table) and flag values (this file must be parsed to be human readable; I used the parser available here). For the sake of demonstration, I will show just a single range of virtual addresses which can be obtained from /proc/$PID/maps.

The file holds all virtual address ranges for a process including libraries associated with each range, permissions and etc:

[root@localhost ~]# tail /proc/16460/maps
7f5a72e70000-7f5a72e77000 rw-s 00000000 00:13 246403                     /dev/shm/PostgreSQL.1005331588
7f5a72e77000-7f5a72e91000 rw-p 00000000 00:00 0
7f5a72e96000-7f5a72e97000 rw-s 00000000 00:04 327680                     /SYSV01008d37 (deleted)
7f5a72e97000-7f5a72e98000 rw-p 00000000 00:00 0
7f5a72e98000-7f5a72e99000 r--p 00021000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e99000-7f5a72e9a000 rw-p 00022000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e9a000-7f5a72e9b000 rw-p 00000000 00:00 0
7fff241ac000-7fff241cd000 rw-p 00000000 00:00 0                          [stack]
7fff241de000-7fff241e0000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let’s take the range associated with the stack space (line 9) 7fff241ac000-7fff241cd000 and pass it to pagemap executable:

[root@localhost pagemap-master]# ./pagemap 16460 0x7f585720f000 0x7f585721b000
0x7f585720f000     : pfn 3e2942           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857210000     : pfn 3e2943           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857211000     : pfn 41c95c           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857212000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857213000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857214000     : pfn 41c95f           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857215000     : pfn 3ff0e8           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857216000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857217000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857218000     : pfn 3ff0eb           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857219000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f585721a000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0

As an example, consider line 1 of the output. The first value is the virtual address 0x7f585720f000 followed by the pfn: 3e2942 (page frame number – the actual physical address of the frame in hexadecimal), followed by attribute names and their values. For this address we see that swapped bit is 0. This is what we were looking for. Thus, for any page from the process page table, we can get the bit value which tells us whether the page is associated with a swap entry.

However, just the fact that the page is mapped to a swap entry does not yet mean that it has been been swapped out. We also have to make sure that the associated page in the swap space is been used. This can be obtained by looking at the locked bit in /proc/kpageflags which is set when a page has exclusively been locked by the process (e.g., when doing I/O). As seen from the last output, the pagemap parser does not report this bit but this can be added. Moreover, the parser has also another executable (pagemap2) which prints all pages of a process.

To sum up, the actual swap space used by a process can be calculated as the number of pages with swapped and locked bits both set times the page size. I will not pretend that I have written a script that calculates it for a running process but making one seems to be a good idea that can help troubleshooting in certain memory-related scenarios.

Note: Although the motivation originated from a PostgreSQL database related problem, the approach should be valid for an arbitrary user process.