Reasoning about OS Scheduler for Performance Degradation Analysis

It all started with a performance issue in one of our clusters. One of the components on the critical path turned out to be the OS scheduler which I analyzed to find out whether it was the offender itself or just a symptom, and the global slowness was caused by other factors in the stack. The whole story is very long and would take (I guess) several pages or so to describe everything. That is why I will just present here a short exert which is a part of the whole diagnosis journey. So, don’t be surprised to notice that the story does not (sort of) have neither the beginning nor the end.

I decided to peek at the OS scheduler as vmstat reported high run queue lengths. This led me to conclude that the former will highly likely bring me to the ultimate offender. The scheduler statistics can be traced with the perf command which has many options. Since the issue was ongoing for several minutes, few tens of seconds would suffice. Moreover, as I didn’t know which scheduler event should be focused upon, I traced all of them:

[fgenceliyev@machine1 ~]$ sudo perf sched record -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 50.183 MB perf.data (132,856 samples) ]

The next step was to check whether what I suspected was true and I was going in the right direction (the output is intentionally shortened):

[fgenceliyev@machine1 ~]$ sudo perf sched latency --sort max

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  MutationStage-1:(2)   |    240.607 ms |    13933 | avg:    0.083 ms | max:  416.618 ms | max at: 275770.668504 s
  C2 CompilerThre:(6)   |   9980.936 ms |     2010 | avg:    0.934 ms | max:  322.393 ms | max at: 275772.038103 s
  MutationStage-8:17873 |     50.710 ms |     1743 | avg:    0.248 ms | max:  310.998 ms | max at: 275771.587630 s
  MutationStage-5:17870 |     62.694 ms |     2213 | avg:    0.347 ms | max:  310.980 ms | max at: 275771.587612 s
  Thread-17:19000       |     49.637 ms |      783 | avg:    0.494 ms | max:  310.795 ms | max at: 275771.587998 s
  vmtoolsd:1972         |     17.737 ms |       73 | avg:    4.401 ms | max:  287.086 ms | max at: 275769.602360 s
  ................................................................................................................
  ................................................................................................................
  ................................................................................................................

No comments here: huge delays ! The are two good news here though: a) the output has a timestamp that shows when the maximum delay happened for a particular process (or thread) and b) it doesn’t matter which process from the output we now chase to drill down further since (whatever it is) all of them suffer from the same.

At this point, we know the delay is there but we don’t know where that delay occurs. In other words, we now need to split the delay into its parts and see where exactly time is spent most (we can take an arbitrary process to measure).

Let’s take the very first timestamp (275770.668504) for analysis and see how time is distributed. This can be achieved by using timehist option of perf which gives a breakdown of the events.

[fgenceliyev@machine1 ~]$ sudo perf sched timehist -f
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  275767.383658 [0001]  cluster1-nio-wo[18972/18944]        0.000      0.000      0.000 
  275767.383663 [0002]  <idle>                              0.000      0.000      0.000 
  275767.383664 [0000]  <idle>                              0.000      0.000      0.000 
  275767.383670 [0002]  MutationStage-5[17870/17693]        0.000      0.000      0.006 
  275767.383670 [0000]  MutationStage-1[17803/17693]        0.000      0.000      0.006 
  275767.383696 [0003]  cluster1-nio-wo[18970/18944]        0.000      0.000      0.000 
  275767.383697 [0001]  Thread-6[18989/18944]               0.000      0.000      0.039 
  275767.383699 [0005]  <idle>                              0.000      0.000      0.000
  .....................................................................................
  .....................................................................................
  .....................................................................................
  275770.668504 [0004]  cluster1-nio-wo[18971/18944]      416.371      0.277      0.254
  .....................................................................................
  .....................................................................................
  .....................................................................................

We are interested in two columns: wait time and sch delay. The first is the time a task was waiting to be woken up and the second is the time from the moment the context switch begins up to the point a task is scheduled in. The latter is the scheduler latency i.e. time it takes the scheduler to change the task’s status to Runnable so that it can then be switched to Running.

What we can see (line 16) is that the scheduler latency is very high: 0.277 msecs but what’s even worse is that the task has been waiting for 416 msecs just to be picked up by the scheduler. Such a situation is a little tricky as it resembles the ‘chicken and egg problem’ since either can cause the other one: the scheduler’s slow decision could be because of the same reasons that the task had to wait long and, conversely, high scheduler latencies can cause significant delays before a task receives a wake-up signal. Without any evidence we cannot judge neither to be the cause. If we don’t have enough evidence (and sometimes we don’t :)) ) we should be able to draw conclusions like what is more and what is less likely.

Since we don’t know which caused the other one we should attack each of them separately as if we don’t have the other issue at all. Thus, the first question is “What affects latency of scheduler decisions?”

Recall that the OS scheduler is just another kernel thread and to do its job it needs CPU time. If the CPU is saturated (for example, by doing heavy I/O or doing moderate I/O from a slow device) then the scheduler’s behavior can be affected and, hence, its latency may rise.

This is a three-node Apache Cassandra cluster. For each type of tasks (such as handling client requests, updating the tables and etc.) the database maintains a pool of threads each of which performs its job. We need to see what those threads are doing and, apparently, that will tell us the actions that burn the CPU (or which are slow). To see the breakdown by threads we will use the psn (Tanel Poder’s process snapper)

=== Active Threads =============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm2             | state                  | wchan                        | kstack                                                                                                                                                                                                                                                                                                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      32 |        4.00 | (Native-Transpor) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      20 |        2.50 | (MutationStage-1) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      19 |        2.38 | (Native-Transpor) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
      18 |        2.25 | (Native-Transpor) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
      16 |        2.00 | (java)            | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       8 |        1.00 | (MutationStage-1) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       7 |        0.88 | (Native-Transpor) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       6 |        0.75 | (Native-Transpor) | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->hrtimer_cancel()->hrtimer_try_to_cancel()                                                                                                                                                                                                                                                                                                                                                        
       5 |        0.62 | (MutationStage-6) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       5 |        0.62 | (epollEventLoopG) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (MutationStage-9) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       4 |        0.50 | (Native-Transpor) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
       4 |        0.50 | (java)            | Running (ON CPU)       | 0                            | do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                                                            
       3 |        0.38 | (MutationStage-1) | Running (ON CPU)       | 0                            | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       3 |        0.38 | (Native-Transpor) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-1) | Disk (Uninterruptible) | __lock_page                  | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       2 |        0.25 | (MutationStage-2) | Running (ON CPU)       | futex_wait_queue_me          | system_call_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                                                                                                                                                                                                                                                                       
       2 |        0.25 | (MutationStage-7) | Disk (Uninterruptible) | call_rwsem_down_write_failed | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->file_update_time()->update_time()->xfs_vn_update_time()->xfs_ilock()->call_rwsem_down_write_failed()                                                                                                                             
       2 |        0.25 | (MutationStage-8) | Disk (Uninterruptible) | wait_on_page_bit_killable    | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->__do_fault.isra.61()->xfs_filemap_fault()->__xfs_filemap_fault()->filemap_fault()->__lock_page_or_retry()->wait_on_page_bit_killable()                                                                                                                                                                 
       2 |        0.25 | (Native-Transpor) | Running (ON CPU)       | 0                            | page_fault()->do_page_fault()->__do_page_fault()->handle_mm_fault()->handle_pte_fault()->do_shared_fault.isra.64()->do_page_mkwrite()->xfs_filemap_page_mkwrite()->__xfs_filemap_fault()->iomap_page_mkwrite()->__lock_page()                                                                                                                                                                                              
       1 |        0.12 | (MessagingServic) | Running (ON CPU)       | 0                            | -                                                                                                                                                                                                                                                                                                                                                                                                                          
  ................................................................................................................
  ................................................................................................................
  ................................................................................................................

The comm2 column shows a thread from the thread pool and the state column shows what that thread was doing at the time of sampling. What is clear (line 59) is that a number of threads is “ON CPU” but what also catches eye is that quite a portion of them are waiting for disk I/O to complete (Disk (Uninterruptible) state). Here I’m showing just one sample and not going into details of the values in wchain and kstack columns. Sampling the threads several times shows the same behavior. For a database like Cassandra (if configured properly) we should not observe many threads spending their time on I/O.

This leads us to suspect that the storage subsystem is not catching up. The iostat output is to the rescue (for the following two outputs I got only screens saved aside):

It takes 53 msecs to write 4.77 MB of data! Plus, avgqu-sz is low, thus, there is no queueing up. The next contact is the storage team.

Let’s also see bcc-tools output just for diversity. The last column is latency in milliseconds and the third column from the end is the amount of data (in bytes) written: 512 KB gets written in 0.4 secs !

The whole investigation journey could have been much shorter should we simply had run iostat and analyze the I/O latency. But that would be just a ‘matter of luck’. Even if the path is long, if you take the right direction from the very beginning it will definitely take you to the cause.

This is it for now! Hope you enjoyed !

Six Ways to Force “Sequential Scan” in PostgreSQL

Here are some six ways that can be used to force a query go with sequential scan access method in PostgresSQL. The objective is to eliminate any index access in the query. The test table looks as follows.

postgres=# create table tablescandemo as select generate_series(1,100000) id;
SELECT 100000
postgres=# create index ind1demo on testtable1(id);
CREATE INDEX

Let’s get started.

Option I. Using enable_indexscan, enable_indexonlyscan and enable_bitmapscan parameters.
The enable_indexscan parameter disables index access path. Let’s try it out.

postgres=# set enable_indexscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tablescandemo  (cost=12.17..482.50 rows=500 width=4) (actual time=0.039..0.041 rows=1 loops=1)
   Recheck Cond: (id = 100)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on inddemo  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.027 rows=1 loops=1)
         Index Cond: (id = 100)
 Planning Time: 0.568 ms
 Execution Time: 0.091 ms
(7 rows)

The planner still accesses the index but this time it is behaving differently. The index is accessed in the same way as in index scan method, however, as a result, the database doesn’t build a set of rowids of the heap rows but a bitmap: one bit per heap page. This is how the bitmap index scan works. The table is sequentially scanned and the bitmap is used to decide whether a particular page must be probed or not. The enable_bitmapscan parameter controls bitmap index scan access method:

postgres=# set enable_bitmapscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using inddemo on tablescandemo  (cost=0.29..4.31 rows=1 width=4) (actual time=0.070..0.071 rows=1 loops=1)
   Index Cond: (id = 100)
   Heap Fetches: 0
 Planning Time: 0.086 ms
 Execution Time: 0.094 ms
(5 rows)

Now, index only scan scan kicks in since the table has a single column which is indexed. Let’s turn it off as well:

postgres=# set enable_indexonlyscan = off;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.021..9.114 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.065 ms
 Execution Time: 9.137 ms
(5 rows)

Basically, once needs to make sure that not just one but all index access methods (index scan, index only scan and bitmap index scan) are off.

Option II. Using random_page_cost parameter.
This parameter tells the optimizer the cost of a randomly accessed disk page. This concerns both index and table pages. Therefore, the higher this value, the more it should cost the optimizer to use index access path, especially, if the index is not very selective. Let’s see the experiment results (index scan, index only scan and bitmap index scan control parameter values are all on now):

postgres=# set random_page_cost = 100000;
SET
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=1 width=4) (actual time=0.034..10.175 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.071 ms
 Execution Time: 10.200 ms
(5 rows)

The optimizer estimates that retrieving tuple(s) via the index will be higher than sequential scan and, hence, prefers solely the latter.

Option III. Altering table statistics.
At the end of the day, optimizer uses database statistics which one can actually fake. Selectivity of the predicate is the primary factor in deciding which access path to take. It is based on the cardinality which itself calculated as the number of tuples that qualify the predicate. The latter is (assuming no extended statistics exist) nothing but N/NDV where N is the number of tuples in the table and NDV is the number of distinct values of the column. The higher the NDV, the more selective the predicate is, and, thus, the higher the probability that the optimizer leans towards index access rather than sequential scan. Therefore, if NDV is low, the optimizer will think that most of the table tuples qualify: that is, index access will cost much more than sequential scan, thus, preferring the latter. Let’s check it out.

postgres=# alter table tablescandemo alter column id set (n_distinct = 1);
ALTER TABLE
postgres=# analyze tablescandemo;
ANALYZE
postgres=# explain analyze select * from tablescandemo where id = 100;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo  (cost=0.00..1693.00 rows=100000 width=4) (actual time=0.015..6.366 rows=1 loops=1)
   Filter: (id = 100)
   Rows Removed by Filter: 99999
 Planning Time: 0.098 ms
 Execution Time: 6.379 ms
(5 rows)

Option IV. ‘Hacking’ table statistics.
Even though the previous option is a valid one, it has two drawbacks: 1) alter table command will lock the table in exclusive mode and 2) analyze table may take considerable amount of time depending on the size of the table. So, if that is the case, then you can update statistics directly in the pg_statistic table. Surely, this is not an accurate way but still works and in extreme cases may come in handy:

postgres=# update pg_statistic set stadistinct = 1 where starelid = (select oid from pg_class where relname = 'tablescandemo');;
UPDATE 1
postgres=# explain analyze select * from tablescandemo where id = 6;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo(cost=0.00..1693.00 rows=100000 width=4) (actual time=0.008..6.612 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.115 ms
 Execution Time: 6.625 ms
(5 rows)

postgres=#

Option V. ‘Hacking’ table statistics (Part II).
Another statistics you can fake is number of pages a table contains. The lesser it is, the higher the probability that sequential scan will kick in. This information is kept in pg_class table and can be manipulated just like any other column value:

postgres=# update pg_class set reltuples=1 where relname = 'tablescandemo';
UPDATE 1
postgres=# explain analyze select * from tablescandemo where id = 6;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Seq Scan on tablescandemo(cost=0.00..443.01 rows=1 width=4) (actual time=0.010..6.496 rows=1 loops=1)
   Filter: (id = 6)
   Rows Removed by Filter: 99999
 Planning Time: 0.138 ms
 Execution Time: 6.511 ms
(5 rows)
postgres=#

Option VI. Using hints.
PostgreSQL does not support hints out-of-the-box. However, you can use pg_hint_plan to hint queries. It provides hints for scan methods, join methods, join order, parallelism, row number correction and few others.

Ohh…yes…and option VII: drop all indexes :)).

Parallel Query and %logons% Statistics

Here is a quick and interesting observation regarding the %logons% statistics. The database version is Oracle 19c and, apparently, it applies to earlier versions as well. The question that I am trying to answer here is this “How does a parallel query execution relate to %logon% statistics?”

Here is a test case.

Let’s check the statistics for the instance:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68368
logons current				      108
user logons cumulative            6762

SQL> 

Now, do nothing but log on to the database:

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

SQL>

Let’s check the statistics again:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68369
logons current				      109
user logons cumulative            6763

SQL> 

As a result, each of the statistics above gets incremented by one. So far, so good.

Now, from within that brand-new session, let’s execute a parallel query:

SQL> select /*+ parallel(8) */ count(*) from tt1;

  COUNT(*)
----------
     73252

SQL> 

Display the statistics to compare:

SQL> select name, value from v$sysstat where name like '%logon%';

NAME					          VALUE
-------------------------------- ----------
logons cumulative				  68377
logons current				      109
user logons cumulative            6763

SQL> 

As a result, the only statistics that changes is “logons cumulative“. It has grown by exactly the number of allocated parallel slaves: 8. Repetitive execution of the same query with different degrees of parallelism increments only this statistics and it does this by that degree each time.

The “logons current” statistics doesn’t change. This one doesn’t include the slave sessions but the coordinator only. The coordinator sid is the actually the sid of the session that executes the query. Another observation is that “user logons cumulative” statistics does not include slave processes either.

This may come in very handy when troubleshooting a logon storm caused by parallel execution.

Implicit Population of Materialized Views in In-Memory Column Store

The quickest and simplest way of ensuring what has been populated in In-Memory column store (IMCS) is to query the V$IM_SEGMENTS view. Whenever an object is loaded into IMCS, the database will add a record about it into this view. In general, I guess the appearance of the IMCS as such is an attempt of Oracle to transition to NewSQL stack.

According to Oracle documentation the IMCS can be enabled for a tablespace, table (including external and excluding IOT), table partition and materialized view (MV). What is more interesting is that it also mentions when, for example, a table gets populated in IMCS – only when it is fully scanned (and not parallelly). Index access methods that cover table tuples and table tuples populated by rowids never load the tuples in IMCS.

Documentation does not say a word about when a MV gets populated in IMCS. It would be reasonable to assume that since a MV is very much like a heap table, then the table IMCS population rules apply to the former as well i.e. for a MV to get populated in IMCS an explicit nonparallel full scan of it is necessary. This is indeed so. Additionally, implicit population of MV is also possible and this with a little trick only. The table access which causes query rewrite (fully accessing the MV) can also trigger it. All this said, this post is not about something advanced but rather an interesting take-away about IMCS.

The test case is on Oracle 12C. First, let’s create a non-empty table and define a MV on it:

SQL> create table testtable1 as select object_id, object_name from dba_objects;

Table created.

SQL> create materialized view testview1 enable query rewrite as select * from testtable1;

Materialized view created.

SQL>

Let’s now enable IMCS for the view and confirm that the it really gets populated in there when scanned fully:

SQL> alter materialized view testview1 inmemory;

Materialized view altered.

SQL> select count(*) from testview1;

  COUNT(*)
----------
     73306

SQL> select segment_name, populate_status from v$im_segments;

SEGMENT_NAME                        POPULATE_STATUS
----------------------------------- -------------
TESTVIEW1			                 COMPLETED

Now, let’s flush the MV from IMCS but make it eligible for population again. Also, let’s query the table so that the query rewrite kicks in:

SQL> alter materialized view testview1 no inmemory;

Materialized view altered.

SQL> alter materialized view testview1 inmemory;

SQL> select segment_name, populate_status from v$im_segments;

no rows selected.

SQL>

As the result from v$im_segments shows, the MV is not there…yet. Let’s cause a full scan of the MV again:

SQL> select count(*) from testview1;

  COUNT(*)
----------
     73306

SQL> select segment_name, populate_status from v$im_segments;

no rows selected.

SQL>

Subsequent executions of the same query still doesn’t make the MV appear in IMCS, although the little trick will make it to get populated:

SQL> alter table testtable1 inmemory;

Table altered.

SQL> select count(*) from testtable1;

  COUNT(*)
----------
     73305

SQL> select segment_name, populate_status from v$im_segments;

SEGMENT_NAME	                POPULATE_STATUS
------------------------------ -------------
TESTVIEW1		                COMPLETED

SQL>

The query plan now is as follows:

Plan hash value: 3466806115
 
-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name      | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |           |     1 |     3   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                            |           |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR                           |           |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)                     | :TQ10000  |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                         |           |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR                     |           | 73306 |     3   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       MAT_VIEW REWRITE ACCESS INMEMORY FULL| TESTVIEW1 | 73306 |     3   (0)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------

The next time the query will execute the table data will be retrieved from the MV which is now in IMCS (the reason for going parallel is described in the note 2373556.1 which is a bug and has a workaround). The query rewrite can cause MV population in IMCS and for that to happen the base table must have the INMEMORY option enabled.

That’s all for this time. More advanced stuff is coming soon…

Parallel Query and V$SQL

The following is a (short and quick) scenario I recently faced when analyzing a parallel query and its response time. This may not apply to earlier versions of Oracle though. The experiments were carried out in 19c.

The ‘weird’ behavior I observed was that V$SQL.EXECUTIONS column didn’t get updated for the query under analysis when executed. The first gotcha is, perhaps, “That must be a bug“. Let’s be more systematic however, and arrive at the root cause gradually.

Here is a test case. First, I am simply executing the query three times and checking the v$sql.

SQL> select sql_id, sql_text, executions, child_number from v$sql where sql_text like '%test1%';

SQL_ID	      SQL_TEXT                              EXECUTIONS CHILD_NUMBER
------------- ---------------------------------     ---------- ------------
4gy00kap2kf3n	select /*+ test1 */ * from table1   3	        0
4gy00kap2kf3n	select /*+ test1 */ * from table1   0	        1

2 rows selected.

SQL> 

The output shows that the database decided to generate another child (CHILD_NUMBER=1) for this statement. However, the primary question here is “Why are there NO executions for the second child?“.

The reason for not sharing that can be found from v$sql_shared_cursor is not of interest, although it may shed some light on the question. Still, let’s take a different strategy without consulting this view.

Two approaches can be taken to move forward with this ‘issue’: a) Try A, try B, try C,… and hope it helps to find out the reason, and perhaps, eventually one of them will and b) Ask the right questions. Before blaming the database for bugs, lets put it aside as the last resort. There must be something about this that at this point not known yet and that may lead to the clue.

The simplest question one can ask here can be something like “Would there by any other statistics that applies to child cursors which has probably been missed out?” This is a separate child and, thus, the entry point should be v$sql. Scanning the column names of the view for ‘%EXEC’% reveals another field: PX_SERVERS_EXECUTIONS. Let’s query this column as well:

SQL> select sql_id, sql_text, executions, px_servers_executions, child_number from v$sql where sql_text like '%test1%';

SQL_ID	      SQL_TEXT						      EXECUTIONS PX_SERVERS_EXECUTIONS CHILD_NUMBER
------------- ---------------------------------- ---------- --------------------- ------------

4gy00kap2kf3n select /*+ test1 */ * from table1  3			 0	                   0
4gy00kap2kf3n select /*+ test1 */ * from table1  0			 12	                   1

2 rows selected.

SQL> 

PX_SERVERS_EXECUTIONS holds the total number of executions by the slaves of a parallel query. This explains why the second cursor executions are not reflected in V$SQL.EXECUTIONS column. For parallel slaves Oracle tracks it in a different column: V$SQL.PX_SERVERS_EXECUTIONS and in this case it has been executed 12 times so far. Since the query does have any parallelism forced by the hint, hence the table-level parallelism must set which is confirmed by dba_tables.degree being equal to 4. The one with CHILD_NUMBER=0 is actually the coordinator. The query executes in parallel:

SQL> select * from dbms_xplan.display_cursor('4gy00kap2kf3n', 0, 'ALLSTATS ALL');

SQL_ID	4gy00kap2kf3n, child number 0
-------------------------------------
select /*+ test1 */ * from table1

Plan hash value: 1574891232
 
--------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          | 73216 |  9438K|   110   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)| :TQ10000 | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX BLOCK ITERATOR |          | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   4 |     TABLE ACCESS FULL| TABLE1   | 73216 |  9438K|   110   (0)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (E - Syntax error (1))
---------------------------------------------------------------------------
   1 -  SEL$1
   E -  test1
Note
-----
   - Degree of Parallelism is 4 because of table property

28 rows selected.

SQL>

Morale: Step-by-step approach is the key regardless of the issue and the DBMS.

A potential method for determining the join method without explain plan

I will not pretend that the idea presented in this post is aimed at addressing a number of troubleshooting scenarios but in some cases it may come in handy. For example, imagine a situation such that you cannot log in to the database because of a long-running query which makes it unresponsive. However, the OS is running fine and can be worked with.

The simple and not-yet-so-mature approach presented here should help one judge about the potential join method that a query is performing. At least, it may give certain (though incomplete) ground to reason about it.

Given all other conditions are same, the Nested Loop join method is the one that will execute the highest number of iterations to fetch the requested tuples. The other two methods, Hash Join and Sort-Merge Join will face less number of iterations and, perhaps, less number of loops.

The scenario with the assumptions above is not a rare. However, the question can be “Once you start suspecting that it is most probably the Nested Loop join that’s causing the hang, how can it help?”. At least, one can tell the software developers to fix the query by changing the join method which should solve the issue. Since (as assumed) the database is unresponsive, one cannot peek at the explain plan and tell for sure. What can help though is “expressing the join method via loops”.

To do this, let’s step aside a little.

When the compiler converts a loop in high-level code (while/for) to assembly, it maps it to a Branch instruction. The branch instruction updates the contents of the Program Counter (PC; also known as Extended Instruction Pointer in x86) register by the address stored in the Branch Target Address (BTA). On the next cycle the CPU fetches the instruction from the PC (which is the address of the instruction pointed to by the branch instruction) and executes it. To illustrate it, here is an assembly example for ARM processors:

.global main

main:
        mov     r0, #0     /* setting up initial variable a */
loop:   cmp     r0, #4     /* checking if a==4 */
        beq     end        /* proceeding to the end if a==4 */
        add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
        b loop             /* repeating the loop */
end:
        bx lr              /* THE END */

Here, b is a branch instruction with the label loop. The label loop points to the instruction cmp (compare). When the CPU reaches the b loop (line 8) instruction, the PC’s contents are updated by the value stored in the loop label which is nothing but the address of the cmp instruction. On the next cycle when the PC is fetched, the address of the cmp (line 5) is returned and the instruction is fetched by its address and executed. In this way, the branch instruction “makes a jump”. This is how a loop iterates over.

Thus, a high number of branch instructions can be an indicator of an excessive execution of iterations, in particular, a Nested Loop join.

The next step is finding some OS statistics that shows the number of branch instructions executing in real time. This is where the perf tool comes into play.

The perf tool is a performance profiler that can display various hardware and software statistics and counters. It can also be attached to a running process. For example, as follows:

# perf stat -p 4539
       1920.159821 task-clock                #    0.991 CPUs utilized          
                13 context-switches          #    0.007 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
               258 page-faults               #    0.134 K/sec                  
     5,649,595,479 cycles                    #    2.942 GHz                     [83.43%]
     1,808,339,931 stalled-cycles-frontend   #   32.01% frontend cycles idle    [83.54%]
     1,171,884,577 stalled-cycles-backend    #   20.74% backend  cycles idle    [66.77%]
     8,625,207,199 instructions              #    1.53  insns per cycle        
                                             #    0.21  stalled cycles per insn [83.51%]
     1,488,797,176 branches                  #  775.351 M/sec                   [82.58%]
        53,395,139 branch-misses             #    3.59% of all branches         [83.78%]

       1.936842598 seconds time elapsed

The command populates statistics until cancelled (CTRL+C). So, the perf command shows that by that time 1,488,797,176 branch instructions were executed out of which 53,395,139 the branch predictor failed to predict. The former is actually is the main metric that is of interest. If a high number of deltas is observed with the branches metric, then this indicates that the code is executing high number of iterations. This in turn, can direct us to suspect that (probably) a Nested Loop join method is running intensively. Peeking at this metric sampled over time and seeing that this number is unreasonably high can give a tip that the code needs to be checked for a loop that is iterating over and over again.

Side note: The Nested Loop join is not the only code path that can cause this. This can also be, for example, due to latch contention since a retry logic is an essential part of pretty much any latch implementation. Another example can be index scan: when the code iteratively walks through the index to find the key which (by the way) can be a part of the Nested Loop join. The approach presented should be applicable to any RDBMS.

In the next post, I will reproduce a highly iterative Nested Loop join and will analyze the metric mentioned. This will give a better insight into improving the idea briefly presented.

The Hidden “INDEX UNIQUE SCAN”

Oracle optimizer can sometimes trick you and this post is about a finding that clearly demonstrates that. Putting it shortly, while the execution plan shows that the optimizer is doing “INDEX RANGE SCANE”, in actuality it is going with “INDEX UNIQUE SCAN”. Here is a test-case in Oracle 19c.

First, lets create a table, populate it with data and add index on its column.

SQL> create table table1(id integer);

Table created.

SQL> insert into table1 select object_id from dba_objects;

7433 rows created.

SQL> create index IND on table1(id);

Index created.

Let’s confirm the type of the index just created which will be needed a little later

SQL> select UNIQUENESS from dba_indexes where index_name = 'IND';

UNIQUENES
---------
NONUNIQUE

Since the index is nonunique, the optimizer will (or should !?) never go with “INDEX UNIQUE SCAN”.

Let’s now add a primary key.

SQL> alter table table1 add primary key(id);

Table altered.

Since an index on the column ID already exists, Oracle doesn’t attempt to create a new one. It will link the constraint to the existing one. We can confirm it by querying DBA_CONSTRAINTS view. Additionally, the database does not update the index type in the data dictionary.

SQL> select constraint_name, index_name from dba_constraints where table_name = 'TABLE1';

CONSTRAINT_NAME                INDEX_NAME
------------------------------ ------------------------------
SYS_C008559                    IND

SQL> select UNIQUENESS from dba_indexes where index_name = 'IND';

UNIQUENES
---------
NONUNIQUE

What we have so far is that the table has a primary key constraint enforced via an index which Oracle reports as NONUNIQUE (which effectively is not so).

It is interesting to see what index access path will be taken by the optimizer for a query with equality predicate.

select * from table1 where id = 2;
============
Plan Table
============
------------------------------------+-----------------------------------+
| Id  | Operation         | Name    | Rows  | Bytes | Cost  | Time      |
------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT  |         |       |       |     1 |           |
| 1   |  INDEX RANGE SCAN | IND     |     1 |    13 |     0 |           |
------------------------------------+-----------------------------------+

The optimizer reports “INDEX RANGE SCAN”. According to the type of the index in the database dictionary this is expected. However, we definitely know that the index is unique since this is what a primary key constraint is enforced by and unique index should not be causing range scans.

Let’s see what session statistics related to indexes the database reports.

select s.name, value from v$mystat m, v$statname s where s.STATISTIC# = m.STATISTIC#
and s.name like '%index%' 
and value != 0;

NAME                        VALUE
--------------------------- ----------
index range scans           1

So, the statistics also state that the optimizer chooses the “INDEX RANGE SCAN” index access method (this statistics is missing in the Database Reference 19c).

Finally, let’s take a look at the optimizer trace (Access path analysis section only) to see what really happens behind the scenes.

 Access Path: TableScan
    Cost:  2.000631  Resp: 2.000631  Degree: 0
      Cost_io: 2.000000  Cost_cpu: 23521
      Resp_io: 2.000000  Resp_cpu: 23521
  Access Path: index (index (FFS))
    Index: IND
    resc_io: 2.000000  resc_cpu: 0
    ix_sel: 0.000000  ix_sel_with_filters: 1.000000
  Access Path: index (FFS)
    Cost:  2.000000  Resp: 2.000000  Degree: 1
      Cost_io: 2.000000  Cost_cpu: 0
      Resp_io: 2.000000  Resp_cpu: 0
 ****** Costing Index IND
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
  Access Path: index (UniqueScan)
    Index: IND
    resc_io: 0.000000  resc_cpu: 1050
    ix_sel: 0.004000  ix_sel_with_filters: 0.004000
    Cost: 0.000028  Resp: 0.000028  Degree: 1
 ****** Costing Index IND
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
  Access Path: index (AllEqUnique)
    Index: IND
    resc_io: 0.000000  resc_cpu: 1050
    ix_sel: 0.004000  ix_sel_with_filters: 0.004000
    Cost: 0.000028  Resp: 0.000028  Degree: 1
 One row Card: 1.000000
  Best:: AccessPath: IndexUnique
  Index: IND
         Cost: 0.000028  Degree: 1  Resp: 0.000028  Card: 1.000000  Bytes: 0.000000

Whola ! In actuality, no “INDEX RANGE SCAN” happens but “INDEX UNIQUE SCAN”. This is confirmed by the access path choice the optimizer ends up with: Best:: AccessPath: IndexUnique (line 30). The “INDEX RANGE SCAN” access path is not even considered as an option. Apparently, the session statistics gets the data from what the optimizer shows up in the execution plan.

In summary, while the optimizer reports “INDEX RANGE SCAN” access method it may not actually be doing so but instead “INDEX UNIQUE SCAN”. The execution plan is not the source of truth in such a case.

The “force merge” process marks indexes searchable in Elasticsearch

It is well known that once Elasticsearch (ES) indexes a document, it does not become searchable right away. For index to be so, the refresh must run which internally copies the documents to a new segment in the OS cache and opens it for searches. Once the index has been refreshed, all documents within it become searchable.

According to ES documentation the refresh process by default runs every second. Alternatively, one can run it manually as well. It can also be disabled entirely by setting “refresh_interval” setting of the index to -1.

A simple experiment reveals that there is another (an undocumented) case which can make index searchable. This is the force merge.

This process runs in the background and periodically merges segments into larger segments. This is necessary in case of updated and/or deleted documents since ES does not implement these operations in-place. For example, when a document is updated, ES creates a new document with a higher version number and the old document is kept until the next merge process which when run out of the several versions of the same document keeps the one with the highest version number.

Let’s put up an experiment and check out how the merge process affects searchability of the index.

First, let’s add an index:

curl -X PUT "localhost:9200/index_custom?pretty"

Let’s then make sure that the refresh is disabled by setting the value of “refresh_interval” to -1:

curl -X PUT "localhost:9200/index_custom/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index" : {
    "refresh_interval": "-1"
  }
}
'

Now, let’s add a couple of documents to the index (with ids 1 and 2 respectively):

curl -X PUT "localhost:9200/index_custom/_doc/1?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{     
    "id": "kimchy",
    "name": "kimchy1"
}
'
curl -X PUT "localhost:9200/index_custom/_doc/2?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{     
    "id": "kimchy",
    "name": "kimchy2"
}
'

Let’s now flush it to disk to ensure that segments have been created:

curl -X POST "localhost:9200/index_custom/_flush?pretty"

Confirm the number of documents in the index (docs.count):

curl 'localhost:9200/_cat/indices?v'
health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index_custom       jqTFVQm0R_u9bL2kIGQrUA   1   1          2            0      4.8kb           225b

Let’s now search a document

curl -X GET "localhost:9200/index_custom/_search?pretty&pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
        "term" : { "id" : "kimchy" }
    }
}
'

and check the response

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

As expected, no document is returned since the index has not been refreshed yet.

Let’s confirm it by checking the statuses of the segments of the index. Notice that the value of the “searchable” column is false for both of the segments.

curl -X GET "localhost:9200/_cat/segments?v=true&pretty"
index              shard prirep ip          segment generation docs.count docs.deleted     size size.memory committed searchable version compound

index_custom       0     r      localhost_0               0          2            0    4.3kb           0 true      false      9.0.0   true
index_custom       0     p      localhosd_0               0          2            0    4.3kb           0 true      false      9.0.0   true

Finally, let’s manually invoke the merge process

curl -X POST "localhost:9200/index_custom/_forcemerge?pretty"

and rerun the query. As a result, the documents are returned which means than indexes have become searchable:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "index_custom",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy1"
        }
      },
      {
        "_index" : "index_custom",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy2"
        }
      }
    ]
  }
}

This can also be confirmed by doublechecking the value of the “searchable” column in the list of segments.

Conclusion: The refresh process is not the only one that makes index searchable. The index can be made so by force merging as well. This is not documented.

Deciphering PostgreSQL process swap space utilization

One of the questions that a colleague of mine recently asked was: “How can I find out how much swap space a PostgreSQL process is really using?”. This triggered a piece of curiosity which I will talk about in this post.

At first, it might seem like we can use the Linux top command which has memory related columns. Let’s run it for an arbitrary process and see the output:

[root@localhost ~]# top -p 3748

top - 00:42:16 up 3 days,  7:43,  5 users,  load average: 0.00, 0.01, 0.05
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16266588 total, 15392968 free,   233036 used,   640584 buff/cache
KiB Swap:  1679356 total,  1615172 free,    64184 used. 15566964 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND               SWAP
3748 postgres  20   0 8851352   4520   3344 S   0.0  0.0   0:00.01 postmaster             972

The top command reports that the process (pid 3748, line 9, last column) is utilizing 972 KB of swap space. There are three questions here:
1) Is this really swap and the OS does not fake it?
2) If it is, then how accurate is this number?
3) Why does the OS allocate swap to this process while there is free RAM?

Here, I will try to answer the first two questions (third one is for another post).

According to the Linux top manual the reported swap column can include any of the quadrants from #1 through #3 each of which includes memory pages allocated via certain functions. For example, quadrant #1 includes process stack and this does not actually tell anything about being swapped. It only tells us that memory pages allocated for process stack are eligible for being swapped in/out.

So, how can we find out then how much of the process address space is mapped and really is on the swap file?

To answer this question, let’s get back to some fundamentals.

A Linux process is provided with a virtual address range. A page from an address from this range can be allocated either from RAM or secondary storage (HDD, SSD, etc.). The process page table maintains mapping from a virtual address to a physical address. Moreover, it also holds control bits each of which tracks some information about the page, e.g., dirty bit in the page table indicates whether the page’s content is in sync with the disk, unevictable bit indicates that the page cannot be evicted from the LRU list, and so on. This information can be queried from /proc/kpageflags file which in the latest kernel holds 26 flags.

Two of the flags are related to swapping out of which swapcache is the one we need. This bit indicates whether the page table entry (PTE) has an associated swap entry.

Let’s now see how we can calculate the total swap space occupied by a process.

The environment for experiments is PostgreSQL 14 running on CentOS 7.7. Let’s first connect to the database and retrieve it’s OS PID to have a process to work with:

-bash-4.2$ psql postgres
psql (14.2)
Type "help" for help.

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          16460
(1 row)

The next step is to get swapcache bit value for every page in every PTE for this process. This information is available in /proc/$PID/pagemaps from where we can find out which physical frame a virtual page is mapped to (the process page table) and flag values (this file must be parsed to be human readable; I used the parser available here). For the sake of demonstration, I will show just a single range of virtual addresses which can be obtained from /proc/$PID/maps.

The file holds all virtual address ranges for a process including libraries associated with each range, permissions and etc:

[root@localhost ~]# tail /proc/16460/maps
7f5a72e70000-7f5a72e77000 rw-s 00000000 00:13 246403                     /dev/shm/PostgreSQL.1005331588
7f5a72e77000-7f5a72e91000 rw-p 00000000 00:00 0
7f5a72e96000-7f5a72e97000 rw-s 00000000 00:04 327680                     /SYSV01008d37 (deleted)
7f5a72e97000-7f5a72e98000 rw-p 00000000 00:00 0
7f5a72e98000-7f5a72e99000 r--p 00021000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e99000-7f5a72e9a000 rw-p 00022000 fd:00 35678                      /usr/lib64/ld-2.17.so;623f47b9 (deleted)
7f5a72e9a000-7f5a72e9b000 rw-p 00000000 00:00 0
7fff241ac000-7fff241cd000 rw-p 00000000 00:00 0                          [stack]
7fff241de000-7fff241e0000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let’s take the range associated with the stack space (line 9) 7fff241ac000-7fff241cd000 and pass it to pagemap executable:

[root@localhost pagemap-master]# ./pagemap 16460 0x7f585720f000 0x7f585721b000
0x7f585720f000     : pfn 3e2942           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857210000     : pfn 3e2943           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857211000     : pfn 41c95c           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857212000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857213000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857214000     : pfn 41c95f           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857215000     : pfn 3ff0e8           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857216000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857217000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f5857218000     : pfn 3ff0eb           soft-dirty 1 file/shared 1 swapped 0 present 1
0x7f5857219000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0
0x7f585721a000     : pfn 0                soft-dirty 1 file/shared 0 swapped 0 present 0

As an example, consider line 1 of the output. The first value is the virtual address 0x7f585720f000 followed by the pfn: 3e2942 (page frame number – the actual physical address of the frame in hexadecimal), followed by attribute names and their values. For this address we see that swapped bit is 0. This is what we were looking for. Thus, for any page from the process page table, we can get the bit value which tells us whether the page is associated with a swap entry.

However, just the fact that the page is mapped to a swap entry does not yet mean that it has been been swapped out. We also have to make sure that the associated page in the swap space is been used. This can be obtained by looking at the locked bit in /proc/kpageflags which is set when a page has exclusively been locked by the process (e.g., when doing I/O). As seen from the last output, the pagemap parser does not report this bit but this can be added. Moreover, the parser has also another executable (pagemap2) which prints all pages of a process.

To sum up, the actual swap space used by a process can be calculated as the number of pages with swapped and locked bits both set times the page size. I will not pretend that I have written a script that calculates it for a running process but making one seems to be a good idea that can help troubleshooting in certain memory-related scenarios.

Note: Although the motivation originated from a PostgreSQL database related problem, the approach should be valid for an arbitrary user process.

“Latch: row cache objects” wait event is gone in Oracle 21c

Oracle 21c seems to be full of little surprises. Initially appeared in the first release of Oracle 10g and one of the most frequently dealt with “Latch: row cache objects” wait event has disappeared in Oracle 21c. Recall that this latch protects row cache (aka data dictionary cache) entries from simultaneous access by several processes and is acquired before a session can modify row cache objects. Oracle 21c no longer has it:

SQL> select name from v$event_name where name like '%latch: row%';

no rows selected

SQL>

An obvious question is: if this latch is gone, then what replaced it?

To answer this question (or at least to make reasonable assumptions that will take us closer to it) let’s examine other wait events that serve similar purpose.

Firstly, recall that the row cache is made of several caches.

select cache#, parameter from v$rowcache;
--------------------------------------------
3	dc_rollback_segments
1	dc_free_extents
4	dc_used_extents
2	dc_segments
0	dc_tablespaces
5	dc_tablespace_quotas
6	dc_files
10	dc_users
8	dc_objects
17	dc_global_oids
12	dc_constraints
13	dc_sequences
16	dc_histogram_defs
54	dc_sql_prs_errors
32	kqlsubheap_object
..............................

For example, the dc_sequences cache (line 14) store sequences and associated metadata. Each cache is protected by a child latch.

Secondly, it would not make much sense to replace one latch by another. From the other hand, the cache still has to be protected by something. Therefore, it has to be something that is more lightweight than latch. This is mutex (less lightweight is a pin).

Let’s see what mutex related wait events are there in Oracle 21c.

select name from v$event_name where name like '%mutex%'
--------------------------------------------------------
SecureFile fso mutex
SecureFile htbm mutex
SecureFile lhtb mutex
row cache mutex
cursor: mutex X
cursor: mutex S
library cache: mutex X
library cache: bucket mutex X
library cache: dependency mutex X
library cache: mutex S

The row cache mutex wait event (line 6) seems very interesting in this context, doesn’t it? Apparently, this is the one that protects the row cache now.

But one question still remains: why Oracle 19c (and earlier versions as well which I haven’t checked) has both “latch: row cache objects” and “row cache mutex” wait events? An undocumented parameter could be there to dictate the protection mechanism which does not seem to be the case, as far as I’ve researched. Most probably, in this version the row cache is protected by both: partly by latches, partly by mutexes and the former is completely gone in 21c.