“Latch: row cache objects” wait event is gone in Oracle 21c

Oracle 21c seems to be full of little surprises. Initially appeared in the first release of Oracle 10g and one of the most frequently dealt with “Latch: row cache objects” wait event has disappeared in Oracle 21c. Recall that this latch protects row cache (aka data dictionary cache) entries from simultaneous access by several processes and is acquired before a session can modify row cache objects. Oracle 21c no longer has it:

SQL> select name from v$event_name where name like '%latch: row%';

no rows selected

SQL>

An obvious question is: if this latch is gone, then what replaced it?

To answer this question (or at least to make reasonable assumptions that will take us closer to it) let’s examine other wait events that serve similar purpose.

Firstly, recall that the row cache is made of several caches.

select cache#, parameter from v$rowcache;
--------------------------------------------
3	dc_rollback_segments
1	dc_free_extents
4	dc_used_extents
2	dc_segments
0	dc_tablespaces
5	dc_tablespace_quotas
6	dc_files
10	dc_users
8	dc_objects
17	dc_global_oids
12	dc_constraints
13	dc_sequences
16	dc_histogram_defs
54	dc_sql_prs_errors
32	kqlsubheap_object
..............................

For example, the dc_sequences cache (line 14) store sequences and associated metadata. Each cache is protected by a child latch.

Secondly, it would not make much sense to replace one latch by another. From the other hand, the cache still has to be protected by something. Therefore, it has to be something that is more lightweight than latch. This is mutex (less lightweight is a pin).

Let’s see what mutex related wait events are there in Oracle 21c.

select name from v$event_name where name like '%mutex%'
--------------------------------------------------------
SecureFile fso mutex
SecureFile htbm mutex
SecureFile lhtb mutex
row cache mutex
cursor: mutex X
cursor: mutex S
library cache: mutex X
library cache: bucket mutex X
library cache: dependency mutex X
library cache: mutex S

The row cache mutex wait event (line 6) seems very interesting in this context, doesn’t it? Apparently, this is the one that protects the row cache now.

But one question still remains: why Oracle 19c (and earlier versions as well which I haven’t checked) has both “latch: row cache objects” and “row cache mutex” wait events? An undocumented parameter could be there to dictate the protection mechanism which does not seem to be the case, as far as I’ve researched. Most probably, in this version the row cache is protected by both: partly by latches, partly by mutexes and the former is completely gone in 21c.

Result Cache Malfunction in Oracle 21c

One of these days I was configuring result cache in an Oracle 21c database and encountered an unusual behavior. Recall, that result cache which is a slice of memory in the shared pool was introduced in Oracle 11g. It stores results of queries and subsequent ones can reuse it.

Looks like it does not work as documented in Oracle 21c. Two modes are available to configure it: manual by hinting the query and force which will make all queries to store their results in the result cache which is, by the way, not the recommended way. I was experimenting on Ubuntu 8. Here is the scenario (note that result_cache_max_size parameter value which enables or disables result cache is 1792K).

First, let’s create a small table and fill it with some data.

create table table1(id integer, name varchar2(30))
    
insert into table1 values (1, 'Name1');
insert into table1 values (2, 'Name2');
insert into table1 values (3, 'Name3');
commit;

The query is as simple as the following one:

select /*+ RESULT_CACHE */ * from table1;

Plan hash value: 963482612

---------------------------------------------
| Id  | Operation         | Name   | E-Rows |
---------------------------------------------
|   0 | SELECT STATEMENT  |        |        |
|   1 |  TABLE ACCESS FULL| TABLE1 |      3 |

Just to make sure you that the query will use the previously cached results we can execute in multiple times. Still, no effect – plan remains the same.
Tracing the optimizer reveals some details (though not yet clear why so):

Dumping Hints
=============
  atom_hint=(@=0x7f25d42835a8 err=38 resol=1 used=1 token=1192 org=1 lvl=2 txt=RESULT_CACHE ())
********** WARNING: SOME HINTS HAVE ERRORS *********
====================== END SQL Statement Dump ======================

That is, the optimizer considers RESULT_CACHE as a hint and errors out explicitly telling that it cannot parse it: err=38. I’ve tried the FULL hint to compare and for this one err=0. It is easy to see that it is not a configuration related problem, for example, minimum data size to be eligible for caching and etc. since the optimizer says that it faces an error. Moreover, just to exclude size related limitations, if any, I tried with a large data size – no effect.

To be continued…

FILLFACTOR parameter in PostgreSQL (Part II – source code hacking)

In Part I of this series I was investigating the effect of fillfactor parameter in Postgres 14.1. It turned out that the database does not always store the updated version of a tuple on a new page whereas it should, at least because of the default value of fillfactor=100.

Now, I will investigate the source code to see what prevents the database from doing it so.

This time I used a table with less number of columns to exclude the The Oversized-Attribute Storage Technique (TOAST). This is how PostgreSQL manages to store large field values since it does not support spanning a tuple across multiple pages.

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30)
      );

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6
)
select	
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

Size of the table is 8KB and fillfactor has default value of 100.

select * from pg_total_relation_size('fillfactortable');
pg_total_relation_size
----------------
 8192 bytes
(1 row)

To find out how a tuple is updated let’s take a look at the source code. The routine that does it starts at line 3155 in \src\backend\access\heap\heapam.c:

TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
			CommandId cid, Snapshot crosscheck, bool wait,
			TM_FailureData *tmfd, LockTupleMode *lockmode)
...............................................................................

Here, the following function is called to do the actual update (line 3926).

RelationPutHeapTuple(relation, newbuf, heaptup, false);

The newbuff parameter’s value decides whether the updated version of the tuple will be on the same page or not. This decision is based on the following condition (line 3670):

pagefree = PageGetHeapFreeSpace(page);
newtupsize = MAXALIGN(newtup->t_len);
if (need_toast || newtupsize > pagefree)
{
..........................................
}
else
	{
		/* No TOAST work needed, and it'll fit on same page */
		newbuf = buffer;
		heaptup = newtup;
	}

In my scenario need_toast is false since the table has just six attributes of varchar(30). Moreover, should there be another page for TOAST the pg_total_relation_size function would return 16KB since additional page would have been allocated. This can be confirmed also by querying pg_class with the corresponding value for oid column. That is, newtupsize > pagefree evaluating to false is the only case when the new version of the tuple is placed on the same page since it is that case newbuf = buffer. This is actually what happens because pagefree receives a value greater than newtupsize.

To investigate it further, let’s see how PageGetHeapFreeSpace works which can be found at line 899 in \src\backend\storage\page\bufpage.c:

Size
PageGetFreeSpace(Page page)
{
	int			space;

	/*
	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
	 * pd_upper.
	 */
	space = (int) ((PageHeader) page)->pd_upper -
		(int) ((PageHeader) page)->pd_lower;

	if (space < (int) sizeof(ItemIdData))
		return 0;
	space -= sizeof(ItemIdData);

	return (Size) space;
}

Here, pd_upper and pd_lower is offset of end of free space and start of free space in a page relatively. By doing this subtraction pd_upperpd_lower it calculates available space between between the end of line pointers and start of tuple values. This condition then (line 13 in the last block) if (space < (int) sizeof(ItemIdData)) checks how much space is left ultimately after deducting space taken by the line pointers.

Let’s what the values are for pd_upper and pd_lower in the table (Heads up! No fillfactor involvement so far!):

SELECT lower, upper FROM page_header(get_raw_page('fillfactortable', 0));
lower       upper
--------   ----------
 28        8096
(1 row)

That is, the page has 8096-28=8068 bytes free. From this the space taken by the single line pointer (4 bytes) has to be subtracted as well which gives 8064 value for the pagefree parameter which is greater than the tuple size which is itself few tens of bytes in size.

Summary:
A table with fillfactor=100 does not place the new version of the updated tuple on the same page what should not be the case according to the documentation. The tuple update routine makes use of PageGetFreeSpace function to decide whether the new version of the updated tuple resides on the same page. Neither the caller nor the routine itself make use fillfactor value. The routine calculates free spaced based on pd_upper and pd_lower and line pointers. In this scenario, pagefree should have been zero in which case the database would allocate additional page and place the new version of the tuple on this new page. As demonstrated, this did not happen even though fillfactor was set to 100. In Part III of this post series I will investigate cases when it does occur.

FILLFACTOR parameter in PostgreSQL (Part I)

The purpose of FILLFACTOR in PostgreSQL is widely known. According to the documentation, the default value of 100 means complete packing. That is, the new version of a tuple will be placed on a different page. Here is a simple experiment that demonstrates that this is not always the case.

Here is the experimental table:

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30),
      col7 varchar(30),col8 varchar(30),col9 varchar(30),col10 varchar(30),col11 varchar(30),col12 varchar(30),
      col13 varchar(30),col14 varchar(30),col15 varchar(30),col16 varchar(30),col17 varchar(30),col18 varchar(30),
      col19 varchar(30),col20 varchar(30)
      ) ;

As seen, the FILLFACTOR is not explicitly specified i.e. default value of is used. Let’s insert a single row

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6, col7,col8,col9,col10,col11,col12,
    col13,col14,col15,col16,col17,col18, col19,col20
)
select	
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

followed by checking the number of pages in the table.

select pg_size_pretty (pg_relation_size('fillfactortable'));
 pg_size_pretty
----------------
 8192 bytes
(1 row)

The table has only one page. Now, let’s update the tuple.

update fillfactortable set col1 = 'random';
UPDATE 1

This results in a dead tuple which can be confirmed via the following query:

select n_live_tup, n_dead_tup, relname from pg_stat_all_tables where relname = 'fillfactortable';
 n_live_tup | n_dead_tup |     relname
------------+------------+-----------------
          1 |          1 | fillfactortable
(1 row)

Now we can check the table size again:

select pg_size_pretty (pg_relation_size('fillfactortable'));
 pg_size_pretty
----------------
 8192 bytes
(1 row)

That is, the table still has only one page i.e. no page has been allocated to store the other version of the updated tuple which should have not happened according to the documentation. In Part II of this topic, I will explore the cases when this happens.

On Node Addition in Apache Cassandra

The purpose of this post is not to reiterate but rather restate how simple is adding a node to a running Cassandra cluster. I will not pretend that this is the whole thing that needs to be done once the node has been added. For example, the cluster will start streaming data to the newly added node which is not always cheap and which must be taken care of once the node has joined the cluster. Nevertheless, the procedure itself is pretty straightforward and the following is the minimum one must do to add a node to a running Cassandra cluster.

Everything below runs on my laptop using Virtual Box VM’s. Here is a cluster of two nodes running on CentOS 7.9:

[centos@centos7 cassandra]$ nodetool status
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.1.107  292.99 KiB  1            100.0%            c9504a89-653b-4e9d-a289-3444fc608a3b  rack1
UN  192.168.1.108  218.35 KiB  1            100.0%            de35416a-256a-445b-ab8d-c6a03da29eee  rack1

The cluster is up and running. What we want to do now is to add the third node. Assuming the machine has been prepared the steps are:
1. Modify the cassandra.yaml file.
2. Start the instance.

That’s it! As simple as that. The minimum that must be changed in the cassandra.yaml file is the seeds and listen_address sections. Seed node is not a specific node. It can be any node(s) from the cluster which the joining node needs to contact to in order to communicate cluster information. Listen_address is the address that nodes use to communicate with each other.

The new node’s IP address is 192.168.1.109 for which 192.168.1.108 is specified as the seed.  Here is how the seeds section looks like after the change.

seed_provider: # Addresses of hosts that are deemed contact points. 
               # Database nodes use this list of hosts to find each other and learn 
               # the topology of the ring. You _must_ change this if you are running 
               # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: 
               # seeds is actually a comma-delimited list of addresses. 
               # Ex: "<ip1>,<ip2>,<ip3>" 
               - seeds: "192.168.1.108"

and here is how I changed the listen_address section

# Address or interface to bind to and tell other nodes to connect to.
# You _must_ change this address or interface to enable multiple nodes to communicate!
#
# Set listen_address OR listen_interface, not both.
#
# When not set (blank), InetAddress.getLocalHost() is used. This
# will always do the Right Thing _if_ the node is properly configured
# (hostname, name resolution, etc), and the Right Thing is to use the
# address associated with the hostname (it might not be).
#
# Setting listen_address to 0.0.0.0 is always wrong.
#
listen_address: 192.168.1.109

Now the cassandra instance can be started after which it can be seen that the cluster is comprised of three nodes:

[centos@centos7 ~]$ nodetool status
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.1.107  306.67 KiB  1            35.7%             c9504a89-653b-4e9d-a289-3444fc608a3b  rack1
UN  192.168.1.108  281.72 KiB  1            85.4%             de35416a-256a-445b-ab8d-c6a03da29eee  rack1
UN  192.168.1.109  326 KiB    1            78.9%             88965b8c-1ba0-4d70-a034-616001795044  rack1

Node addition is completed.

In Favor of Temporary Tablespaces

Imagine a query that faces PGA shortage and heavily spills to temporary tablespace to complete (sorting, hashing and etc.). A typical solution is to tune the query, whereas the following can also act as a way out.
The environment is Oracle 19.3 on Linux 6.4. A test table with ~4 million rows is indexed on one of the columns with temporary tablespace datafiles residing on disk. Index creation requires sorting and the volume of data (4 million rows) forces disk spills. In such a case, 11904 temporary datafile blocks have been allocated and the query completes in 16 seconds.
Then, we load the temporary datafiles in /tmpfs on Linux which caches it in the RAM. Oracle is actually agnostic where the files reside: disk or memory. This time the index creation (started from scratch) spills to 11904 blocks just like before but the query runs two times faster. OS analysis also confirms that the blocks reside in memory this time.

The World’s Shortest Troubleshooting Scenario

The problem:  A few minute spike on control file sequential read event.
The cause:      A logon storm.
The proof:      A logon trigger querying v$database (that’s a controlfile access). Hence, the wait event.
The solution:  Either remove the query from the trigger or keep the query and find out the reason of excessive logons at that interval given that it will be eliminated on that side. The latter is the correct way to go.

Tunable PGA analogue in Hadoop

Assume your query needs to do sorting. If the sort memory size (derived from pga_aggregate_target) is not big enough to complete the whole sorting in memory, Oracle will spill to disk to it. The sorting process will go though multiple passes to sort the whole data.

PGA is a private memory area allocated exclusively to a process. Memory chunks in it are categorized into tunable and untunable. The latter is the memory area allocated for pl/sql collections, varchar types and etc. The former can be controlled by adjusting corresponding parameters. When pga_aggregate_target is set to a non-zero value, Oracle adjusts certain internal (undocumented) parameters which is then used for making decisions when allocating memory to certain operations. For example, the memory area used for hash join is one of those derived from it. Another one is memory used for sortingThe tunable part can be controlled.

Regarding Hadoop…MapReduce works in two phases. Map phase prepares data by converting it into a collection of key-value value pairs which are then consumed by reducers. This phase sorts data by key before passing it to reducers. Sorting may spill to disk, if the sort buffer is not enough to complete the sorting entirely in memory. So, in order to fasten the job itself one may want to consider increasing the mapreduce.task.io.sort.mb (defaulting to 100 MB) parameter to give extra memory to do the sort. This is a potential tuning trick if you know that the slowness of the job is indeed related to the fact that the map phase spills multiple times to the disk to complete. That said, this parameter mapreduce.task.io.sort.mb in Hadoop is nothing but _smm_max_size in Oracle.

More Hadoop tricks in the next posts…

Troubleshooting Slowness Associated with Log Activity (again briefly)

One of databases running on Oracle 12.1 was hanging for some minutes. The top wait event was log file sync. This wait event is posted whenever a user session issues COMMIT or ROLLBACK. A message is then sent to LGWR to flush logs associated with the transaction from the log buffer to redo log files. Redo storage slowness can cause high number of this wait event too. Ultimate blocker in either case is LGWR.
In most cases (but not always), high commit frequency is the cause and should be resolved (decreased) on the application side. That was not the case this time. The reason was the fact that a session (just a single one) issued lots DML statements which generated huge amount of redo. Once the session issued commit, it took quite a long for LGWR to flush redo from the log buffer to the disk. That stalled all other sessions because LGWR was busy and committing sessions were stuck as well. The transaction logic must be changed in the application to issue intermittent commits and not to commit only once when it ends.

Library cache: mutex X and another slowness scenario (briefly)

One of databases was eventually going completely unresponsive and didn’t resume normal execution until the suspected SQL was manually flushed from the shared pool. AWR indicates that library cache: mutex X wait event prevails. The fact that this mutex is taken whenever the corresponding library cache bucket is changed is well known.

So, the first suspect in case of this wait event is hard parse that can be caused by, for example, undersized shared pool, high version counts and etc. Another cause can be CPU saturation which prevents a thread from jumping on the CPU so that the necessary mutex will be released. This time no hard parses at all. At least, not so much as to worry about.

Looks like there are other cases (not hard parse related) that can cause the library cache bucket mutex to be taken exclusively. For example, execution count of a statement which is held in the parent statement. Whenever a statistics need to be changed the mutex must be held in exclusive mode. This fits very well when we observe that a statement executed ~30,000,000 times, during the other occurrence of the issue  ~28,000,000 times within 10 minutes. Once this loop will be removed/replaced, the issue will most probably be gone.