Diving deeper into the LWLock in PostgreSQL.

Lightweight locks (LWLocks) are a type of interprocess lock used by PostgreSQL to coordinate simultaneous access of multiple processes to a shared memory region. There are various reasons that can cause this lock to create issues in the database. I will now provide a short yet concise demo, followed by an exploration of the nature of LWLock in PostgreSQL through reference to its source code and OS stuff.

As a first step, I will generate some load on the database using pgbench. While not extremely flexible, this tool is adequate for testing some basics. The provided inputs are as follows.

[postgres@centos8 ~]$ pgbench -c 10 -j 10 -t 100000 -S tpcc
pgbench (15.5)
starting vacuum...end.

It spawns 10 (-j) worker threads to execute 10 (-c) concurrent processes, each of which runs 100,000 read-only queries as fast as possible (this is the default, and the rate is configurable as well). Visually, it appears as follows:

This is certainly not the behavior one would expect from a well-performing database with reasonable response times. The dark read LWLock prevails and we certainly need to bring it down. Typically, the LWLock is accompanied by I/O, as is the case here. However, this is not always the scenario, as a few other reasons may exist. Let’s assume we have no information about the database and try to figure out what is going on.

PostgreSQL introduced the pgsentinel extension, which, when loaded, samples the active sessions every second by default and makes the information available through the pg_active_session_history view. This allows querying what the sessions were doing at a particular point in time or within a range, making it very useful when diagnosing performance issues:

[postgres@centos8 home]$ ./ash wait_event_type,wait_event "backend_type='client backend'" '2023-11-24 20:29:00' '2023-11-24 20:30:00'
 wait_event_type |    wait_event    | samples | sessions 
 CPU             | CPU              |     213 |       11
 LWLock          | BufferMapping    |     196 |       10
 IO              | DataFileRead     |      61 |       10
 IPC             | BufferIO         |      50 |       10
 IPC             | ExecuteGather    |       7 |        7
 IPC             | BgWorkerShutdown |       4 |        4
(6 rows)

This is a bash script which connects to the database and queries data from pg_active_session_history view. The first list (wait_event_type, wait_event) is the grouping columns followed by the filter, and followed by the range in which we are interested.

From the output (as well as the chart), it becomes evident that the highlighted LWLock is the culprit, with approximately 10 sessions waiting for BufferMapping. This wait event occurs when a process is waiting to map a page from the disk to a buffer in the shared buffers. Once the necessary disk page has been identified, the process must find a free buffer in the buffer pool to load the page into. This is precisely what the processes are currently awaiting. Before we proceed to rerun the script and examine additional dimensions that may be helpful, let’s draw reasonable few conclusions here.

If the process cannot find a free buffer in the buffer pool, then the most likely suspect is that the “Shared buffers is full“. The (misleading) path to take then would (or could) be something like “Well, if it is full, then let’s add up some space.” Although a feasible yet rare solution in practice, the correct approach to handling it would be to ask, ‘What is polluting it?”. Well…definitely, the queries :)). Let’s then find out what these queries look like (only top 20 rows are returned, by default).

[postgres@centos8 home]$ ./ash wait_event_type,wait_event,query "backend_type='client backend'" '2023-11-24 20:29:00' '2023-11-24 20:30:00'
 wait_event_type |  wait_event   |                           query                           | samples | sessions 
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 1679538 |       5 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 6508293 |       4 |        1
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 649929  |       4 |        1
 LWLock          | BufferMapping | SELECT abalance FROM pgbench_accounts WHERE aid = 6487070 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 5088721 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9953261 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1659220 |       4 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9321817 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9104089 |       3 |        1
 IO              | DataFileRead  | SELECT abalance FROM pgbench_accounts WHERE aid = 8129419 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1997345 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1172966 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 513858  |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 9534579 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 1393497 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 3166138 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 3735733 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 2971116 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 4772208 |       3 |        1
 CPU             | CPU           | SELECT abalance FROM pgbench_accounts WHERE aid = 2960432 |       3 |        1
(20 rows)

All sessions are executing the same query but passing different literals. They look very innocent: a small point query which (apparently; because the column “aid” looks like some sort of ID) returns just a single row). Remember, we have no information about the database – nothing at all. That is, we cannot allow some names, IDs, and etc. put us on the wrong path.

We now know what is polluting the buffer pool. So, we have to see why this query polluting it. The query explain plan will show us how the details of the query execution:

tpcc=# explain (analyze, buffers) SELECT abalance FROM pgbench_accounts WHERE aid = 1679538;
                                                             QUERY PLAN                                                             
 Gather  (cost=1000.00..217018.43 rows=1 width=4) (actual time=2434.719..2434.813 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=12678 read=151257
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..216018.33 rows=1 width=4) (actual time=2166.930..2423.078 rows=0 loops=3)
         Filter: (aid = 1679538)
         Rows Removed by Filter: 3333333
         Buffers: shared hit=12678 read=151257
   Buffers: shared hit=41 read=1
 Planning Time: 6.413 ms
 Execution Time: 2434.930 ms
(12 rows)

The query reads 151257 pages from disk to return a single row (rows=1). Luckily, 12678 are found in the shared buffer but still, 1.5k pages is way too high to spit out a single row. The other interesting aspect here is the second highlighted line: “Rows Removed by Filter: 3333333“. In actuality, the database didn’t remove this number of rows by filter. It is much more than that: 9999999. The explanation is in another LinkedIn post of mine :)).

So, the query reads lots of pages, out of which only one holds the necessary row. This is what actually pollutes the buffer. Adding an index on the filtering column will definitely make it much more better. As a result, the query will read only two pages: one from the index, another one from the table (unless the table/index is bloated). This is how things got better after the column has been indexed:

[postgres@centos8 home]$ ./ash wait_event_type,wait_event "1=1" '2023-11-24 22:00:00' '2023-11-24 22:01:00'
 wait_event_type |  wait_event  | samples | sessions 
 Client          | ClientRead   |      36 |       10
 CPU             | CPU          |      16 |        8
 IO              | DataFileRead |      16 |        8
(3 rows)

LWLock is gone and not there anymore.

Now that it is clear how to deal with LWLock (at least in a typical case) related waits, it is also important to understand the implementation details of it. This time, we’ll delve into the details of how it is acquired (the release should be much simpler), and what actually counts as wait duration when the lock is being waited for. The current wait interface in PostgreSQL does not expose the duration of the wait. One of the first valid approaches is to examine the functions that execute when the process attempts to acquire the lock.

Let’s sample the stack trace with pstack and see what is going on (I’ve sampled it many times in a loop for a single process but will show only the top one here.):

[root@centos8 ~]# pstack 4391
#0  0x00007fd95fd9cda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd95fd9ce98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00000000007815a2 in PGSemaphoreLock (sema=0x7fd819711eb8) at pg_sema.c:327
#3  0x00000000007fca6c in LWLockAcquire (lock=lock@entry=0x7fd819715580, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1315
#4  0x00000000007dbf8f in BufferAlloc (foundPtr=0x7fff9354e59b, strategy=0x1c7da98, blockNum=134785, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x1c3d8f8) at bufmgr.c:1320
#5  ReadBuffer_common (smgr=0x1c3d8f8, relpersistence=<optimized out>, forkNum=MAIN_FORKNUM, blockNum=134785, mode=RBM_NORMAL, strategy=0x1c7da98, hit=0x7fff9354e647) at bufmgr.c:873
#6  0x00000000007dc6f0 in ReadBufferExtended (reln=0x7fd96169d1f8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=134785, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at ../../../../src/include/utils/rel.h:560
#7  0x0000000000549c66 in heapgetpage (sscan=sscan@entry=0x1c7cd18, page=page@entry=134785) at heapam.c:402
#8  0x000000000054a1f6 in heapgettup_pagemode (scan=scan@entry=0x1c7cd18, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1120
#9  0x000000000054b162 in heap_getnextslot (sscan=0x1c7cd18, direction=<optimized out>, slot=0x1c73270) at heapam.c:1352
#10 0x00000000006cff6b in table_scan_getnextslot (slot=0x1c73270, direction=ForwardScanDirection, sscan=<optimized out>) at ../../../src/include/access/tableam.h:1046
#11 SeqNext (node=node@entry=0x1c730d0) at nodeSeqscan.c:80
#12 0x00000000006a9c52 in ExecScanFetch (recheckMtd=0x6cfeb0 <SeqRecheck>, accessMtd=0x6cfee0 <SeqNext>, node=0x1c730d0) at execScan.c:132
#13 ExecScan (node=0x1c730d0, accessMtd=0x6cfee0 <SeqNext>, recheckMtd=0x6cfeb0 <SeqRecheck>) at execScan.c:198
#14 0x00000000006ba8fa in ExecProcNode (node=0x1c730d0) at ../../../src/include/executor/executor.h:262
#15 gather_getnext (gatherstate=0x1c72ee0) at nodeGather.c:295
#16 ExecGather (pstate=0x1c72ee0) at nodeGather.c:227

Relevant lines are highlighted. The BufferAlloc function is responsible for seeking a free buffer. Once found, it will call LWLockAcquire which attempts to acquire the lock on the buffer. The remaining three functions (0, 1, and 2) suggest that the process is likely using a semaphore (in addition, some futex-related operations are also present) when it fails to acquire the lock.

What cannot be seen from here is whether the process sleeps or spins, and if it sleeps, then for how long. Tracing system calls can help us here. Their frequency and inputs can help us discern the nature of the waiting. Let’s strace it and see:

pread64(6, "\6\0\0\0(\257\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62840832) = 8192
pread64(6, "\6\0\0\0\330\311\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62849024) = 8192
pread64(6, "\6\0\0\0h\343\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62857216) = 8192
pread64(6, "\6\0\0\0 \375\250\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62865408) = 8192
pread64(6, "\6\0\0\0\360\26\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62873600) = 8192
pread64(6, "\6\0\0\0\3000\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62881792) = 8192
pread64(6, "\6\0\0\0\220J\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62889984) = 8192
pread64(6, "\6\0\0\0`d\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62898176) = 8192
pread64(6, "\6\0\0\0\30~\251\340\0\0\4\0\f\1\200\1\0 \4 \0\0\0\0\200\237\362\0\0\237\362\0"..., 8192, 62906368) = 8192
futex(0x7fc549a67038, FUTEX_WAKE, 1)    = 1

This is a portion of the trace. The futex syscall is not that frequently there. Moreover, on lines 15 and 16 there are two of them with different inputs. The one on line 15 makes the calling thread to sleep until awakened. This is because the fourth parameter, which determines the timeout of the sleep, is NULL. The next futex call wakes up a waiting thread. The third input is 1, indicating that only one thread must be awakened.

We now know how long a thread sleeps when it fails to acquire the lock — until it is woken up. Additionally, we have determined that there is either no or very little spinning.

We now peek at the source code. The call stack above shows that the LWLockAcquire is there in the lwlock.c file (irrelevant lines have been replaced with dots):

LWLockAcquire(LWLock *lock, LWLockMode mode)
	for (;;)
		bool		mustwait;

		 * Try to grab the lock the first time, we're not in the waitqueue
		 * yet/anymore.
		mustwait = LWLockAttemptLock(lock, mode);

		if (!mustwait)
			LOG_LWDEBUG("LWLockAcquire", lock, "immediately acquired lock");
			break;				/* got the lock */
		/* add to the queue */
		LWLockQueueSelf(lock, mode);

		/* we're now guaranteed to be woken up if necessary */
		mustwait = LWLockAttemptLock(lock, mode);

		/* ok, grabbed the lock the second time round, need to undo queueing */
		if (!mustwait)
			LOG_LWDEBUG("LWLockAcquire", lock, "acquired, undoing queue");


		 * Wait until awakened.
		 * It is possible that we get awakened for a reason other than being
		 * signaled by LWLockRelease.  If so, loop back and wait again.  Once
		 * we've gotten the LWLock, re-increment the sema by the number of
		 * additional signals received.
		LOG_LWDEBUG("LWLockAcquire", lock, "waiting");



		for (;;)
			if (!proc->lwWaiting)

		/* Retrying, allow LWLockRelease to release waiters again. */
		pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_RELEASE_OK);

		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");

		/* Now loop back and try to acquire lock again. */
		result = false;
	return result;

This is how it works. First, the process attempts to acquire the lock (line 13), and if unsuccessful, it queues itself up (line 21). The next attempt follows (line 24), and if successful, the process dequeues itself (line 31), allowing it to proceed when awakened. If it fails, it starts sleeping on a semaphore (line 55). Right before the sleep, the process registers a wait by calling LWLockReportWaitStart (line 49) and ends the wait by calling LWLockReportWaitEnd (line 64). The sleep ends when the process is awakened (line 57), and it retries to acquire the lock (line 62).

Therefore, the process of LWLock acquisition is more about sleeping than spinning. The duration of the wait is the time a process sleeps on a semaphore, excluding external waits. Just a few spins are negligible in comparison to this.

More drill-down in future post…

Deciphering PostgreSQL process swap space utilization

One of the questions that a colleague of mine recently asked was: “How can I find out how much swap space a PostgreSQL process is really using?”. This triggered a piece of curiosity which I will talk about in this post.

At first, it might seem like we can use the Linux top command which has memory related columns. Let’s run it for an arbitrary process and see the output:

[root@localhost ~]# top -p 3748

top - 00:42:16 up 3 days,  7:43,  5 users,  load average: 0.00, 0.01, 0.05
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16266588 total, 15392968 free,   233036 used,   640584 buff/cache
KiB Swap:  1679356 total,  1615172 free,    64184 used. 15566964 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND               SWAP
3748 postgres  20   0 8851352   4520   3344 S   0.0  0.0   0:00.01 postmaster             972

The top command reports that the process (pid 3748, line 9, last column) is utilizing 972 KB of swap space. There are three questions here:
1) Is this really swap and the OS does not fake it?
2) If it is, then how accurate is this number?
3) Why does the OS allocate swap to this process while there is free RAM?

Here, I will try to answer the first two questions (third one is for another post).

According to the Linux top manual the reported swap column can include any of the quadrants from #1 through #3 each of which includes memory pages allocated via certain functions. For example, quadrant #1 includes process stack and this does not actually tell anything about being swapped. It only tells us that memory pages allocated for process stack are eligible for being swapped in/out.

So, how can we find out then how much of the process address space is mapped and really is on the swap file?

To answer this question, let’s get back to some fundamentals.

A Linux process is provided with a virtual address range. A page from an address from this range can be allocated either from RAM or secondary storage (HDD, SSD, etc.). The process page table maintains mapping from a virtual address to a physical address. Moreover, it also holds control bits each of which tracks some information about the page, e.g., dirty bit in the page table indicates whether the page’s content is in sync with the disk, unevictable bit indicates that the page cannot be evicted from the LRU list, and so on. This information can be queried from /proc/kpageflags file which in the latest kernel holds 26 flags.

Two of the flags are related to swapping out of which swapcache is the one we need. This bit indicates whether the page table entry (PTE) has an associated swap entry.

Let’s now see how we can calculate the total swap space occupied by a process.

The environment for experiments is PostgreSQL 14 running on CentOS 7.7. Let’s first connect to the database and retrieve it’s OS PID to have a process to work with:

-bash-4.2$ psql postgres
psql (14.2)
Type "help" for help.

postgres=# select pg_backend_pid();
(1 row)

The next step is to get swapcache bit value for every page in every PTE for this process. This information is available in /proc/$PID/pagemaps from where we can find out which physical frame a virtual page is mapped to (the process page table) and flag values (this file must be parsed to be human readable; I used the parser available here). For the sake of demonstration, I will show just a single range of virtual addresses which can be obtained from /proc/$PID/maps.

The file holds all virtual address ranges for a process including libraries associated with each range, permissions and etc:

[root@localhost ~]# tail /proc/16460/maps
7f5a72e70000-7f5a72e77000 rw-s 00000000 00:13 246403                     /dev/shm/PostgreSQL.1005331588
7f5a72e77000-7f5a72e91000 rw-p 00000000 00:00 0
7f5a72e96000-7f5a72e97000 rw-s 00000000 00:04 327680                     /SYSV01008d37 (deleted)
7f5a72e97000-7f5a72e98000 rw-p 00000000 00:00 0
7f5a72e98000-7f5a72e99000 r--p 00021000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e99000-7f5a72e9a000 rw-p 00022000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e9a000-7f5a72e9b000 rw-p 00000000 00:00 0
7fff241ac000-7fff241cd000 rw-p 00000000 00:00 0                          [stack]
7fff241de000-7fff241e0000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let’s take the range associated with the stack space (line 9) 7fff241ac000-7fff241cd000 and pass it to pagemap executable:

[root@localhost pagemap-master]# ./pagemap 16460 0x7f585720f000 0x7f585721b000
0x7f585720f000     : pfn 3e2942           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857210000     : pfn 3e2943           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857211000     : pfn 41c95c           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857212000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857213000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857214000     : pfn 41c95f           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857215000     : pfn 3ff0e8           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857216000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857217000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857218000     : pfn 3ff0eb           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857219000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f585721a000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0

As an example, consider line 1 of the output. The first value is the virtual address 0x7f585720f000 followed by the pfn: 3e2942 (page frame number – the actual physical address of the frame in hexadecimal), followed by attribute names and their values. For this address we see that swapped bit is 0. This is what we were looking for. Thus, for any page from the process page table, we can get the bit value which tells us whether the page is associated with a swap entry.

However, just the fact that the page is mapped to a swap entry does not yet mean that it has been been swapped out. We also have to make sure that the associated page in the swap space is been used. This can be obtained by looking at the locked bit in /proc/kpageflags which is set when a page has exclusively been locked by the process (e.g., when doing I/O). As seen from the last output, the pagemap parser does not report this bit but this can be added. Moreover, the parser has also another executable (pagemap2) which prints all pages of a process.

To sum up, the actual swap space used by a process can be calculated as the number of pages with swapped and locked bits both set times the page size. I will not pretend that I have written a script that calculates it for a running process but making one seems to be a good idea that can help troubleshooting in certain memory-related scenarios.

Note: Although the motivation originated from a PostgreSQL database related problem, the approach should be valid for an arbitrary user process.