Deciphering PostgreSQL process swap space utilization

One of the questions that a colleague of mine recently asked was: “How can I find out how much swap space a PostgreSQL process is really using?”. This triggered a piece of curiosity which I will talk about in this post.

At first, it might seem like we can use the Linux top command which has memory related columns. Let’s run it for an arbitrary process and see the output:

[root@localhost ~]# top -p 3748

top - 00:42:16 up 3 days,  7:43,  5 users,  load average: 0.00, 0.01, 0.05
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16266588 total, 15392968 free,   233036 used,   640584 buff/cache
KiB Swap:  1679356 total,  1615172 free,    64184 used. 15566964 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND               SWAP
3748 postgres  20   0 8851352   4520   3344 S   0.0  0.0   0:00.01 postmaster             972

The top command reports that the process (pid 3748, line 9, last column) is utilizing 972 KB of swap space. There are three questions here:
1) Is this really swap and the OS does not fake it?
2) If it is, then how accurate is this number?
3) Why does the OS allocate swap to this process while there is free RAM?

Here, I will try to answer the first two questions (third one is for another post).

According to the Linux top manual the reported swap column can include any of the quadrants from #1 through #3 each of which includes memory pages allocated via certain functions. For example, quadrant #1 includes process stack and this does not actually tell anything about being swapped. It only tells us that memory pages allocated for process stack are eligible for being swapped in/out.

So, how can we find out then how much of the process address space is mapped and really is on the swap file?

To answer this question, let’s get back to some fundamentals.

A Linux process is provided with a virtual address range. A page from an address from this range can be allocated either from RAM or secondary storage (HDD, SSD, etc.). The process page table maintains mapping from a virtual address to a physical address. Moreover, it also holds control bits each of which tracks some information about the page, e.g., dirty bit in the page table indicates whether the page’s content is in sync with the disk, unevictable bit indicates that the page cannot be evicted from the LRU list, and so on. This information can be queried from /proc/kpageflags file which in the latest kernel holds 26 flags.

Two of the flags are related to swapping out of which swapcache is the one we need. This bit indicates whether the page table entry (PTE) has an associated swap entry.

Let’s now see how we can calculate the total swap space occupied by a process.

The environment for experiments is PostgreSQL 14 running on CentOS 7.7. Let’s first connect to the database and retrieve it’s OS PID to have a process to work with:

-bash-4.2$ psql postgres
psql (14.2)
Type "help" for help.

postgres=# select pg_backend_pid();
(1 row)

The next step is to get swapcache bit value for every page in every PTE for this process. This information is available in /proc/$PID/pagemaps from where we can find out which physical frame a virtual page is mapped to (the process page table) and flag values (this file must be parsed to be human readable; I used the parser available here). For the sake of demonstration, I will show just a single range of virtual addresses which can be obtained from /proc/$PID/maps.

The file holds all virtual address ranges for a process including libraries associated with each range, permissions and etc:

[root@localhost ~]# tail /proc/16460/maps
7f5a72e70000-7f5a72e77000 rw-s 00000000 00:13 246403                     /dev/shm/PostgreSQL.1005331588
7f5a72e77000-7f5a72e91000 rw-p 00000000 00:00 0
7f5a72e96000-7f5a72e97000 rw-s 00000000 00:04 327680                     /SYSV01008d37 (deleted)
7f5a72e97000-7f5a72e98000 rw-p 00000000 00:00 0
7f5a72e98000-7f5a72e99000 r--p 00021000 fd:00 35678                      /usr/lib64/;623f47b9 (deleted)
7f5a72e99000-7f5a72e9a000 rw-p 00022000 fd:00 35678                      /usr/lib64/;623f47b9 (deleted)
7f5a72e9a000-7f5a72e9b000 rw-p 00000000 00:00 0
7fff241ac000-7fff241cd000 rw-p 00000000 00:00 0                          [stack]
7fff241de000-7fff241e0000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let’s take the range associated with the stack space (line 9) 7fff241ac000-7fff241cd000 and pass it to pagemap executable:

[root@localhost pagemap-master]# ./pagemap 16460 0x7f585720f000 0x7f585721b000
0x7f585720f000     : pfn 3e2942           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857210000     : pfn 3e2943           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857211000     : pfn 41c95c           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857212000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857213000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857214000     : pfn 41c95f           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857215000     : pfn 3ff0e8           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857216000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857217000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857218000     : pfn 3ff0eb           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857219000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f585721a000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0

As an example, consider line 1 of the output. The first value is the virtual address 0x7f585720f000 followed by the pfn: 3e2942 (page frame number – the actual physical address of the frame in hexadecimal), followed by attribute names and their values. For this address we see that swapped bit is 0. This is what we were looking for. Thus, for any page from the process page table, we can get the bit value which tells us whether the page is associated with a swap entry.

However, just the fact that the page is mapped to a swap entry does not yet mean that it has been been swapped out. We also have to make sure that the associated page in the swap space is been used. This can be obtained by looking at the locked bit in /proc/kpageflags which is set when a page has exclusively been locked by the process (e.g., when doing I/O). As seen from the last output, the pagemap parser does not report this bit but this can be added. Moreover, the parser has also another executable (pagemap2) which prints all pages of a process.

To sum up, the actual swap space used by a process can be calculated as the number of pages with swapped and locked bits both set times the page size. I will not pretend that I have written a script that calculates it for a running process but making one seems to be a good idea that can help troubleshooting in certain memory-related scenarios.

Note: Although the motivation originated from a PostgreSQL database related problem, the approach should be valid for an arbitrary user process.

“Latch: row cache objects” wait event is gone in Oracle 21c

Oracle 21c seems to be full of little surprises. Initially appeared in the first release of Oracle 10g and one of the most frequently dealt with “Latch: row cache objects” wait event has disappeared in Oracle 21c. Recall that this latch protects row cache (aka data dictionary cache) entries from simultaneous access by several processes and is acquired before a session can modify row cache objects. Oracle 21c no longer has it:

SQL> select name from v$event_name where name like '%latch: row%';

no rows selected


An obvious question is: if this latch is gone, then what replaced it?

To answer this question (or at least to make reasonable assumptions that will take us closer to it) let’s examine other wait events that serve similar purpose.

Firstly, recall that the row cache is made of several caches.

select cache#, parameter from v$rowcache;
3	dc_rollback_segments
1	dc_free_extents
4	dc_used_extents
2	dc_segments
0	dc_tablespaces
5	dc_tablespace_quotas
6	dc_files
10	dc_users
8	dc_objects
17	dc_global_oids
12	dc_constraints
13	dc_sequences
16	dc_histogram_defs
54	dc_sql_prs_errors
32	kqlsubheap_object

For example, the dc_sequences cache (line 14) store sequences and associated metadata. Each cache is protected by a child latch.

Secondly, it would not make much sense to replace one latch by another. From the other hand, the cache still has to be protected by something. Therefore, it has to be something that is more lightweight than latch. This is mutex (less lightweight is a pin).

Let’s see what mutex related wait events are there in Oracle 21c.

select name from v$event_name where name like '%mutex%'
SecureFile fso mutex
SecureFile htbm mutex
SecureFile lhtb mutex
row cache mutex
cursor: mutex X
cursor: mutex S
library cache: mutex X
library cache: bucket mutex X
library cache: dependency mutex X
library cache: mutex S

The row cache mutex wait event (line 6) seems very interesting in this context, doesn’t it? Apparently, this is the one that protects the row cache now.

But one question still remains: why Oracle 19c (and earlier versions as well which I haven’t checked) has both “latch: row cache objects” and “row cache mutex” wait events? An undocumented parameter could be there to dictate the protection mechanism which does not seem to be the case, as far as I’ve researched. Most probably, in this version the row cache is protected by both: partly by latches, partly by mutexes and the former is completely gone in 21c.

Result Cache Malfunction in Oracle 21c

One of these days I was configuring result cache in an Oracle 21c database and encountered an unusual behavior. Recall, that result cache which is a slice of memory in the shared pool was introduced in Oracle 11g. It stores results of queries and subsequent ones can reuse it.

Looks like it does not work as documented in Oracle 21c. Two modes are available to configure it: manual by hinting the query and force which will make all queries to store their results in the result cache which is, by the way, not the recommended way. I was experimenting on Ubuntu 8. Here is the scenario (note that result_cache_max_size parameter value which enables or disables result cache is 1792K).

First, let’s create a small table and fill it with some data.

create table table1(id integer, name varchar2(30))
insert into table1 values (1, 'Name1');
insert into table1 values (2, 'Name2');
insert into table1 values (3, 'Name3');

The query is as simple as the following one:

select /*+ RESULT_CACHE */ * from table1;

Plan hash value: 963482612

| Id  | Operation         | Name   | E-Rows |
|   0 | SELECT STATEMENT  |        |        |
|   1 |  TABLE ACCESS FULL| TABLE1 |      3 |

Just to make sure you that the query will use the previously cached results we can execute in multiple times. Still, no effect – plan remains the same.
Tracing the optimizer reveals some details (though not yet clear why so):

Dumping Hints
  atom_hint=(@=0x7f25d42835a8 err=38 resol=1 used=1 token=1192 org=1 lvl=2 txt=RESULT_CACHE ())
********** WARNING: SOME HINTS HAVE ERRORS *********
====================== END SQL Statement Dump ======================

That is, the optimizer considers RESULT_CACHE as a hint and errors out explicitly telling that it cannot parse it: err=38. I’ve tried the FULL hint to compare and for this one err=0. It is easy to see that it is not a configuration related problem, for example, minimum data size to be eligible for caching and etc. since the optimizer says that it faces an error. Moreover, just to exclude size related limitations, if any, I tried with a large data size – no effect.

To be continued…

FILLFACTOR parameter in PostgreSQL (Part II – source code hacking)

In Part I of this series I was investigating the effect of fillfactor parameter in Postgres 14.1. It turned out that the database does not always store the updated version of a tuple on a new page whereas it should, at least because of the default value of fillfactor=100.

Now, I will investigate the source code to see what prevents the database from doing it so.

This time I used a table with less number of columns to exclude the The Oversized-Attribute Storage Technique (TOAST). This is how PostgreSQL manages to store large field values since it does not support spanning a tuple across multiple pages.

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30)

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

Size of the table is 8KB and fillfactor has default value of 100.

select * from pg_total_relation_size('fillfactortable');
 8192 bytes
(1 row)

To find out how a tuple is updated let’s take a look at the source code. The routine that does it starts at line 3155 in \src\backend\access\heap\heapam.c:

heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
			CommandId cid, Snapshot crosscheck, bool wait,
			TM_FailureData *tmfd, LockTupleMode *lockmode)

Here, the following function is called to do the actual update (line 3926).

RelationPutHeapTuple(relation, newbuf, heaptup, false);

The newbuff parameter’s value decides whether the updated version of the tuple will be on the same page or not. This decision is based on the following condition (line 3670):

pagefree = PageGetHeapFreeSpace(page);
newtupsize = MAXALIGN(newtup->t_len);
if (need_toast || newtupsize > pagefree)
		/* No TOAST work needed, and it'll fit on same page */
		newbuf = buffer;
		heaptup = newtup;

In my scenario need_toast is false since the table has just six attributes of varchar(30). Moreover, should there be another page for TOAST the pg_total_relation_size function would return 16KB since additional page would have been allocated. This can be confirmed also by querying pg_class with the corresponding value for oid column. That is, newtupsize > pagefree evaluating to false is the only case when the new version of the tuple is placed on the same page since it is that case newbuf = buffer. This is actually what happens because pagefree receives a value greater than newtupsize.

To investigate it further, let’s see how PageGetHeapFreeSpace works which can be found at line 899 in \src\backend\storage\page\bufpage.c:

PageGetFreeSpace(Page page)
	int			space;

	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
	 * pd_upper.
	space = (int) ((PageHeader) page)->pd_upper -
		(int) ((PageHeader) page)->pd_lower;

	if (space < (int) sizeof(ItemIdData))
		return 0;
	space -= sizeof(ItemIdData);

	return (Size) space;

Here, pd_upper and pd_lower is offset of end of free space and start of free space in a page relatively. By doing this subtraction pd_upperpd_lower it calculates available space between between the end of line pointers and start of tuple values. This condition then (line 13 in the last block) if (space < (int) sizeof(ItemIdData)) checks how much space is left ultimately after deducting space taken by the line pointers.

Let’s what the values are for pd_upper and pd_lower in the table (Heads up! No fillfactor involvement so far!):

SELECT lower, upper FROM page_header(get_raw_page('fillfactortable', 0));
lower       upper
--------   ----------
 28        8096
(1 row)

That is, the page has 8096-28=8068 bytes free. From this the space taken by the single line pointer (4 bytes) has to be subtracted as well which gives 8064 value for the pagefree parameter which is greater than the tuple size which is itself few tens of bytes in size.

A table with fillfactor=100 does not place the new version of the updated tuple on the same page what should not be the case according to the documentation. The tuple update routine makes use of PageGetFreeSpace function to decide whether the new version of the updated tuple resides on the same page. Neither the caller nor the routine itself make use fillfactor value. The routine calculates free spaced based on pd_upper and pd_lower and line pointers. In this scenario, pagefree should have been zero in which case the database would allocate additional page and place the new version of the tuple on this new page. As demonstrated, this did not happen even though fillfactor was set to 100. In Part III of this post series I will investigate cases when it does occur.

FILLFACTOR parameter in PostgreSQL (Part I)

The purpose of FILLFACTOR in PostgreSQL is widely known. According to the documentation, the default value of 100 means complete packing. That is, the new version of a tuple will be placed on a different page. Here is a simple experiment that demonstrates that this is not always the case.

Here is the experimental table:

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30),
      col7 varchar(30),col8 varchar(30),col9 varchar(30),col10 varchar(30),col11 varchar(30),col12 varchar(30),
      col13 varchar(30),col14 varchar(30),col15 varchar(30),col16 varchar(30),col17 varchar(30),col18 varchar(30),
      col19 varchar(30),col20 varchar(30)
      ) ;

As seen, the FILLFACTOR is not explicitly specified i.e. default value of is used. Let’s insert a single row

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6, col7,col8,col9,col10,col11,col12,
    col13,col14,col15,col16,col17,col18, col19,col20
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

followed by checking the number of pages in the table.

select pg_size_pretty (pg_relation_size('fillfactortable'));
 8192 bytes
(1 row)

The table has only one page. Now, let’s update the tuple.

update fillfactortable set col1 = 'random';

This results in a dead tuple which can be confirmed via the following query:

select n_live_tup, n_dead_tup, relname from pg_stat_all_tables where relname = 'fillfactortable';
 n_live_tup | n_dead_tup |     relname
          1 |          1 | fillfactortable
(1 row)

Now we can check the table size again:

select pg_size_pretty (pg_relation_size('fillfactortable'));
 8192 bytes
(1 row)

That is, the table still has only one page i.e. no page has been allocated to store the other version of the updated tuple which should have not happened according to the documentation. In Part II of this topic, I will explore the cases when this happens.

On Node Addition in Apache Cassandra

The purpose of this post is not to reiterate but rather restate how simple is adding a node to a running Cassandra cluster. I will not pretend that this is the whole thing that needs to be done once the node has been added. For example, the cluster will start streaming data to the newly added node which is not always cheap and which must be taken care of once the node has joined the cluster. Nevertheless, the procedure itself is pretty straightforward and the following is the minimum one must do to add a node to a running Cassandra cluster.

Everything below runs on my laptop using Virtual Box VM’s. Here is a cluster of two nodes running on CentOS 7.9:

[centos@centos7 cassandra]$ nodetool status
Datacenter: Cassandra
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  292.99 KiB  1            100.0%            c9504a89-653b-4e9d-a289-3444fc608a3b  rack1
UN  218.35 KiB  1            100.0%            de35416a-256a-445b-ab8d-c6a03da29eee  rack1

The cluster is up and running. What we want to do now is to add the third node. Assuming the machine has been prepared the steps are:
1. Modify the cassandra.yaml file.
2. Start the instance.

That’s it! As simple as that. The minimum that must be changed in the cassandra.yaml file is the seeds and listen_address sections. Seed node is not a specific node. It can be any node(s) from the cluster which the joining node needs to contact to in order to communicate cluster information. Listen_address is the address that nodes use to communicate with each other.

The new node’s IP address is for which is specified as the seed.  Here is how the seeds section looks like after the change.

seed_provider: # Addresses of hosts that are deemed contact points. 
               # Database nodes use this list of hosts to find each other and learn 
               # the topology of the ring. You _must_ change this if you are running 
               # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: 
               # seeds is actually a comma-delimited list of addresses. 
               # Ex: "<ip1>,<ip2>,<ip3>" 
               - seeds: ""

and here is how I changed the listen_address section

# Address or interface to bind to and tell other nodes to connect to.
# You _must_ change this address or interface to enable multiple nodes to communicate!
# Set listen_address OR listen_interface, not both.
# When not set (blank), InetAddress.getLocalHost() is used. This
# will always do the Right Thing _if_ the node is properly configured
# (hostname, name resolution, etc), and the Right Thing is to use the
# address associated with the hostname (it might not be).
# Setting listen_address to is always wrong.

Now the cassandra instance can be started after which it can be seen that the cluster is comprised of three nodes:

[centos@centos7 ~]$ nodetool status
Datacenter: Cassandra
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  306.67 KiB  1            35.7%             c9504a89-653b-4e9d-a289-3444fc608a3b  rack1
UN  281.72 KiB  1            85.4%             de35416a-256a-445b-ab8d-c6a03da29eee  rack1
UN  326 KiB    1            78.9%             88965b8c-1ba0-4d70-a034-616001795044  rack1

Node addition is completed.

In Favor of Temporary Tablespaces

Imagine a query that faces PGA shortage and heavily spills to temporary tablespace to complete (sorting, hashing and etc.). A typical solution is to tune the query, whereas the following can also act as a way out.
The environment is Oracle 19.3 on Linux 6.4. A test table with ~4 million rows is indexed on one of the columns with temporary tablespace datafiles residing on disk. Index creation requires sorting and the volume of data (4 million rows) forces disk spills. In such a case, 11904 temporary datafile blocks have been allocated and the query completes in 16 seconds.
Then, we load the temporary datafiles in /tmpfs on Linux which caches it in the RAM. Oracle is actually agnostic where the files reside: disk or memory. This time the index creation (started from scratch) spills to 11904 blocks just like before but the query runs two times faster. OS analysis also confirms that the blocks reside in memory this time.

The World’s Shortest Troubleshooting Scenario

The problem:  A few minute spike on control file sequential read event.
The cause:      A logon storm.
The proof:      A logon trigger querying v$database (that’s a controlfile access). Hence, the wait event.
The solution:  Either remove the query from the trigger or keep the query and find out the reason of excessive logons at that interval given that it will be eliminated on that side. The latter is the correct way to go.

Tunable PGA analogue in Hadoop

Assume your query needs to do sorting. If the sort memory size (derived from pga_aggregate_target) is not big enough to complete the whole sorting in memory, Oracle will spill to disk to it. The sorting process will go though multiple passes to sort the whole data.

PGA is a private memory area allocated exclusively to a process. Memory chunks in it are categorized into tunable and untunable. The latter is the memory area allocated for pl/sql collections, varchar types and etc. The former can be controlled by adjusting corresponding parameters. When pga_aggregate_target is set to a non-zero value, Oracle adjusts certain internal (undocumented) parameters which is then used for making decisions when allocating memory to certain operations. For example, the memory area used for hash join is one of those derived from it. Another one is memory used for sortingThe tunable part can be controlled.

Regarding Hadoop…MapReduce works in two phases. Map phase prepares data by converting it into a collection of key-value value pairs which are then consumed by reducers. This phase sorts data by key before passing it to reducers. Sorting may spill to disk, if the sort buffer is not enough to complete the sorting entirely in memory. So, in order to fasten the job itself one may want to consider increasing the (defaulting to 100 MB) parameter to give extra memory to do the sort. This is a potential tuning trick if you know that the slowness of the job is indeed related to the fact that the map phase spills multiple times to the disk to complete. That said, this parameter in Hadoop is nothing but _smm_max_size in Oracle.

More Hadoop tricks in the next posts…

Troubleshooting Slowness Associated with Log Activity (again briefly)

One of databases running on Oracle 12.1 was hanging for some minutes. The top wait event was log file sync. This wait event is posted whenever a user session issues COMMIT or ROLLBACK. A message is then sent to LGWR to flush logs associated with the transaction from the log buffer to redo log files. Redo storage slowness can cause high number of this wait event too. Ultimate blocker in either case is LGWR.
In most cases (but not always), high commit frequency is the cause and should be resolved (decreased) on the application side. That was not the case this time. The reason was the fact that a session (just a single one) issued lots DML statements which generated huge amount of redo. Once the session issued commit, it took quite a long for LGWR to flush redo from the log buffer to the disk. That stalled all other sessions because LGWR was busy and committing sessions were stuck as well. The transaction logic must be changed in the application to issue intermittent commits and not to commit only once when it ends.