Six Ways to Force “Sequential Scan” in PostgreSQL

Here are some six ways that can be used to force a query go with sequential scan access method in PostgresSQL. The objective is to eliminate any index access in the query. The test table looks as follows.

postgres=# create table tablescandemo as select generate_series(1,100000) id;
SELECT 100000
postgres=# create index ind1demo on testtable1(id);
CREATE INDEX

Let’s get started.

Option I. Using enable_indexscan, enable_indexonlyscan and enable_bitmapscan parameters.
The enable_indexscan parameter disables index access path. Let’s try it out.

postgres=# set enable_indexscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tablescandemo  (cost=12.17..482.50 rows=500 width=4) (actual time=0.039..0.041 rows=1 loops=1)
   Recheck Cond: (id = 100)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on inddemo  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.027 rows=1 loops=1)
         Index Cond: (id = 100)
 Planning Time: 0.568 ms
 Execution Time: 0.091 ms
(7 rows)

The planner still accesses the index but this time it is behaving differently. The index is accessed in the same way as in index scan method, however, as a result, the database doesn’t build a set of rowids of the heap rows but a bitmap: one bit per heap page. This is how the bitmap index scan works. The table is sequentially scanned and the bitmap is used to decide whether a particular page must be probed or not. The enable_bitmapscan parameter controls bitmap index scan access method:

postgres=# set enable_bitmapscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using inddemo on tablescandemo  (cost=0.29..4.31 rows=1 width=4) (actual time=0.070..0.071 rows=1 loops=1)
   Index Cond: (id = 100)
   Heap Fetches: 0
 Planning Time: 0.086 ms
 Execution Time: 0.094 ms
(5 rows)

Now, index only scan scan kicks in since the table has a single column which is indexed. Let’s turn it off as well:

postgres=# set enable_indexonlyscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.021..9.114 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.065 ms
 Execution Time: 9.137 ms
(5 rows)

Basically, once needs to make sure that not just one but all index access methods (index scan, index only scan and bitmap index scan) are off.

Option II. Using random_page_cost parameter.
This parameter tells the optimizer the cost of a randomly accessed disk page. This concerns both index and table pages. Therefore, the higher this value, the more it should cost the optimizer to use index access path, especially, if the index is not very selective. Let’s see the experiment results (index scan, index only scan and bitmap index scan control parameter values are all on now):

postgres=# set random_page_cost = 100000;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.034..10.175 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.071 ms
 Execution Time: 10.200 ms
(5 rows)

The optimizer estimates that retrieving tuple(s) via the index will be higher than sequential scan and, hence, prefers solely the latter.

Option III. Altering table statistics.
At the end of the day, optimizer uses database statistics which one can actually fake. Selectivity of the predicate is the primary factor in deciding which access path to take. It is based on the cardinality which itself calculated as the number of tuples that qualify the predicate. The latter is (assuming no extended statistics exist) nothing but N/NDV where N is the number of tuples in the table and NDV is the number of distinct values of the column. The higher the NDV, the more selective the predicate is, and, thus, the higher the probability that the optimizer leans towards index access rather than sequential scan. Therefore, if NDV is low, the optimizer will think that most of the table tuples qualify: that is, index access will cost much more than sequential scan, thus, preferring the latter. Let’s check it out.

postgres=# alter table tablescandemo alter column id set (n_distinct = 1);
ALTER TABLE
postgres=# analyze tablescandemo;
ANALYZE
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=100000 width=4) (actual time=0.015..6.366 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.098 ms
 Execution Time: 6.379 ms
(5 rows)

Option IV. ‘Hacking’ table statistics.
Even though the previous option is a valid one, it has two drawbacks: 1) alter table command will lock the table in exclusive mode and 2) analyze table may take considerable amount of time depending on the size of the table. So, if that is the case, then you can update statistics directly in the pg_statistic table. Surely, this is not an accurate way but still works and in extreme cases may come in handy:

postgres=# update pg_statistic set stadistinct = 1 where starelid = (select oid from pg_class where relname = 'tablescandemo');;
UPDATE 1
postgres=# explain analyze select * from tablescandemo where id = 6;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo(cost=0.00..1693.00 rows=100000 width=4) (actual time=0.008..6.612 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.115 ms
 Execution Time: 6.625 ms
(5 rows)

postgres=#

Option V. ‘Hacking’ table statistics (Part II).
Another statistics you can fake is number of pages a table contains. The lesser it is, the higher the probability that sequential scan will kick in. This information is kept in pg_class table and can be manipulated just like any other column value:

postgres=# update pg_class set reltuples=1 where relname = 'tablescandemo';
UPDATE 1
postgres=# explain analyze select * from tablescandemo where id = 6;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo(cost=0.00..443.01 rows=1 width=4) (actual time=0.010..6.496 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.138 ms
 Execution Time: 6.511 ms
(5 rows)
postgres=#

Option VI. Using hints.
PostgreSQL does not support hints out-of-the-box. However, you can use pg_hint_plan to hint queries. It provides hints for scan methods, join methods, join order, parallelism, row number correction and few others.

Ohh…yes…and option VII: drop all indexes :)).

Deciphering PostgreSQL process swap space utilization

One of the questions that a colleague of mine recently asked was: “How can I find out how much swap space a PostgreSQL process is really using?”. This triggered a piece of curiosity which I will talk about in this post.

At first, it might seem like we can use the Linux top command which has memory related columns. Let’s run it for an arbitrary process and see the output:

[root@localhost ~]# top -p 3748

top - 00:42:16 up 3 days,  7:43,  5 users,  load average: 0.00, 0.01, 0.05
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16266588 total, 15392968 free,   233036 used,   640584 buff/cache
KiB Swap:  1679356 total,  1615172 free,    64184 used. 15566964 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND               SWAP
3748 postgres  20   0 8851352   4520   3344 S   0.0  0.0   0:00.01 postmaster             972

The top command reports that the process (pid 3748, line 9, last column) is utilizing 972 KB of swap space. There are three questions here:
1) Is this really swap and the OS does not fake it?
2) If it is, then how accurate is this number?
3) Why does the OS allocate swap to this process while there is free RAM?

Here, I will try to answer the first two questions (third one is for another post).

According to the Linux top manual the reported swap column can include any of the quadrants from #1 through #3 each of which includes memory pages allocated via certain functions. For example, quadrant #1 includes process stack and this does not actually tell anything about being swapped. It only tells us that memory pages allocated for process stack are eligible for being swapped in/out.

So, how can we find out then how much of the process address space is mapped and really is on the swap file?

To answer this question, let’s get back to some fundamentals.

A Linux process is provided with a virtual address range. A page from an address from this range can be allocated either from RAM or secondary storage (HDD, SSD, etc.). The process page table maintains mapping from a virtual address to a physical address. Moreover, it also holds control bits each of which tracks some information about the page, e.g., dirty bit in the page table indicates whether the page’s content is in sync with the disk, unevictable bit indicates that the page cannot be evicted from the LRU list, and so on. This information can be queried from /proc/kpageflags file which in the latest kernel holds 26 flags.

Two of the flags are related to swapping out of which swapcache is the one we need. This bit indicates whether the page table entry (PTE) has an associated swap entry.

Let’s now see how we can calculate the total swap space occupied by a process.

The environment for experiments is PostgreSQL 14 running on CentOS 7.7. Let’s first connect to the database and retrieve it’s OS PID to have a process to work with:

-bash-4.2$ psql postgres
psql (14.2)
Type "help" for help.

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          16460
(1 row)

The next step is to get swapcache bit value for every page in every PTE for this process. This information is available in /proc/$PID/pagemaps from where we can find out which physical frame a virtual page is mapped to (the process page table) and flag values (this file must be parsed to be human readable; I used the parser available here). For the sake of demonstration, I will show just a single range of virtual addresses which can be obtained from /proc/$PID/maps.

The file holds all virtual address ranges for a process including libraries associated with each range, permissions and etc:

[root@localhost ~]# tail /proc/16460/maps
7f5a72e70000-7f5a72e77000 rw-s 00000000 00:13 246403                     /dev/shm/PostgreSQL.1005331588
7f5a72e77000-7f5a72e91000 rw-p 00000000 00:00 0
7f5a72e96000-7f5a72e97000 rw-s 00000000 00:04 327680                     /SYSV01008d37 (deleted)
7f5a72e97000-7f5a72e98000 rw-p 00000000 00:00 0
7f5a72e98000-7f5a72e99000 r--p 00021000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e99000-7f5a72e9a000 rw-p 00022000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e9a000-7f5a72e9b000 rw-p 00000000 00:00 0
7fff241ac000-7fff241cd000 rw-p 00000000 00:00 0                          [stack]
7fff241de000-7fff241e0000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let’s take the range associated with the stack space (line 9) 7fff241ac000-7fff241cd000 and pass it to pagemap executable:

[root@localhost pagemap-master]# ./pagemap 16460 0x7f585720f000 0x7f585721b000
0x7f585720f000     : pfn 3e2942           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857210000     : pfn 3e2943           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857211000     : pfn 41c95c           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857212000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857213000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857214000     : pfn 41c95f           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857215000     : pfn 3ff0e8           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857216000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857217000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857218000     : pfn 3ff0eb           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857219000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f585721a000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0

As an example, consider line 1 of the output. The first value is the virtual address 0x7f585720f000 followed by the pfn: 3e2942 (page frame number – the actual physical address of the frame in hexadecimal), followed by attribute names and their values. For this address we see that swapped bit is 0. This is what we were looking for. Thus, for any page from the process page table, we can get the bit value which tells us whether the page is associated with a swap entry.

However, just the fact that the page is mapped to a swap entry does not yet mean that it has been been swapped out. We also have to make sure that the associated page in the swap space is been used. This can be obtained by looking at the locked bit in /proc/kpageflags which is set when a page has exclusively been locked by the process (e.g., when doing I/O). As seen from the last output, the pagemap parser does not report this bit but this can be added. Moreover, the parser has also another executable (pagemap2) which prints all pages of a process.

To sum up, the actual swap space used by a process can be calculated as the number of pages with swapped and locked bits both set times the page size. I will not pretend that I have written a script that calculates it for a running process but making one seems to be a good idea that can help troubleshooting in certain memory-related scenarios.

Note: Although the motivation originated from a PostgreSQL database related problem, the approach should be valid for an arbitrary user process.

FILLFACTOR parameter in PostgreSQL (Part II – source code hacking)

In Part I of this series I was investigating the effect of fillfactor parameter in Postgres 14.1. It turned out that the database does not always store the updated version of a tuple on a new page whereas it should, at least because of the default value of fillfactor=100.

Now, I will investigate the source code to see what prevents the database from doing it so.

This time I used a table with less number of columns to exclude the The Oversized-Attribute Storage Technique (TOAST). This is how PostgreSQL manages to store large field values since it does not support spanning a tuple across multiple pages.

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30)
      );

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6
)
select	
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

Size of the table is 8KB and fillfactor has default value of 100.

select * from pg_total_relation_size('fillfactortable');
pg_total_relation_size
----------------
 8192 bytes
(1 row)

To find out how a tuple is updated let’s take a look at the source code. The routine that does it starts at line 3155 in \src\backend\access\heap\heapam.c:

TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
			CommandId cid, Snapshot crosscheck, bool wait,
			TM_FailureData *tmfd, LockTupleMode *lockmode)
...............................................................................

Here, the following function is called to do the actual update (line 3926).

RelationPutHeapTuple(relation, newbuf, heaptup, false);

The newbuff parameter’s value decides whether the updated version of the tuple will be on the same page or not. This decision is based on the following condition (line 3670):

pagefree = PageGetHeapFreeSpace(page);
newtupsize = MAXALIGN(newtup->t_len);
if (need_toast || newtupsize > pagefree)
{
..........................................
}
else
	{
		/* No TOAST work needed, and it'll fit on same page */
		newbuf = buffer;
		heaptup = newtup;
	}

In my scenario need_toast is false since the table has just six attributes of varchar(30). Moreover, should there be another page for TOAST the pg_total_relation_size function would return 16KB since additional page would have been allocated. This can be confirmed also by querying pg_class with the corresponding value for oid column. That is, newtupsize > pagefree evaluating to false is the only case when the new version of the tuple is placed on the same page since it is that case newbuf = buffer. This is actually what happens because pagefree receives a value greater than newtupsize.

To investigate it further, let’s see how PageGetHeapFreeSpace works which can be found at line 899 in \src\backend\storage\page\bufpage.c:

Size
PageGetFreeSpace(Page page)
{
	int			space;

	/*
	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
	 * pd_upper.
	 */
	space = (int) ((PageHeader) page)->pd_upper -
		(int) ((PageHeader) page)->pd_lower;

	if (space < (int) sizeof(ItemIdData))
		return 0;
	space -= sizeof(ItemIdData);

	return (Size) space;
}

Here, pd_upper and pd_lower is offset of end of free space and start of free space in a page relatively. By doing this subtraction pd_upperpd_lower it calculates available space between between the end of line pointers and start of tuple values. This condition then (line 13 in the last block) if (space < (int) sizeof(ItemIdData)) checks how much space is left ultimately after deducting space taken by the line pointers.

Let’s what the values are for pd_upper and pd_lower in the table (Heads up! No fillfactor involvement so far!):

SELECT lower, upper FROM page_header(get_raw_page('fillfactortable', 0));
lower       upper
--------   ----------
 28        8096
(1 row)

That is, the page has 8096-28=8068 bytes free. From this the space taken by the single line pointer (4 bytes) has to be subtracted as well which gives 8064 value for the pagefree parameter which is greater than the tuple size which is itself few tens of bytes in size.

Summary:
A table with fillfactor=100 does not place the new version of the updated tuple on the same page what should not be the case according to the documentation. The tuple update routine makes use of PageGetFreeSpace function to decide whether the new version of the updated tuple resides on the same page. Neither the caller nor the routine itself make use fillfactor value. The routine calculates free spaced based on pd_upper and pd_lower and line pointers. In this scenario, pagefree should have been zero in which case the database would allocate additional page and place the new version of the tuple on this new page. As demonstrated, this did not happen even though fillfactor was set to 100. In Part III of this post series I will investigate cases when it does occur.

FILLFACTOR parameter in PostgreSQL (Part I)

The purpose of FILLFACTOR in PostgreSQL is widely known. According to the documentation, the default value of 100 means complete packing. That is, the new version of a tuple will be placed on a different page. Here is a simple experiment that demonstrates that this is not always the case.

Here is the experimental table:

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30),
      col7 varchar(30),col8 varchar(30),col9 varchar(30),col10 varchar(30),col11 varchar(30),col12 varchar(30),
      col13 varchar(30),col14 varchar(30),col15 varchar(30),col16 varchar(30),col17 varchar(30),col18 varchar(30),
      col19 varchar(30),col20 varchar(30)
      ) ;

As seen, the FILLFACTOR is not explicitly specified i.e. default value of is used. Let’s insert a single row

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6, col7,col8,col9,col10,col11,col12,
    col13,col14,col15,col16,col17,col18, col19,col20
)
select	
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

followed by checking the number of pages in the table.

select pg_size_pretty (pg_relation_size('fillfactortable'));
 pg_size_pretty
----------------
 8192 bytes
(1 row)

The table has only one page. Now, let’s update the tuple.

update fillfactortable set col1 = 'random';
UPDATE 1

This results in a dead tuple which can be confirmed via the following query:

select n_live_tup, n_dead_tup, relname from pg_stat_all_tables where relname = 'fillfactortable';
 n_live_tup | n_dead_tup |     relname
------------+------------+-----------------
          1 |          1 | fillfactortable
(1 row)

Now we can check the table size again:

select pg_size_pretty (pg_relation_size('fillfactortable'));
 pg_size_pretty
----------------
 8192 bytes
(1 row)

That is, the table still has only one page i.e. no page has been allocated to store the other version of the updated tuple which should have not happened according to the documentation. In Part II of this topic, I will explore the cases when this happens.