Result Cache Malfunction in Oracle 21c

One of these days I was configuring result cache in an Oracle 21c database and encountered an unusual behavior. Recall, that result cache which is a slice of memory in the shared pool was introduced in Oracle 11g. It stores results of queries and subsequent ones can reuse it.

Looks like it does not work as documented in Oracle 21c. Two modes are available to configure it: manual by hinting the query and force which will make all queries to store their results in the result cache which is, by the way, not the recommended way. I was experimenting on Ubuntu 8. Here is the scenario (note that result_cache_max_size parameter value which enables or disables result cache is 1792K).

First, let’s create a small table and fill it with some data.

create table table1(id integer, name varchar2(30))
insert into table1 values (1, 'Name1');
insert into table1 values (2, 'Name2');
insert into table1 values (3, 'Name3');

The query is as simple as the following one:

select /*+ RESULT_CACHE */ * from table1;

Plan hash value: 963482612

| Id  | Operation         | Name   | E-Rows |
|   0 | SELECT STATEMENT  |        |        |
|   1 |  TABLE ACCESS FULL| TABLE1 |      3 |

Just to make sure you that the query will use the previously cached results we can execute in multiple times. Still, no effect – plan remains the same.
Tracing the optimizer reveals some details (though not yet clear why so):

Dumping Hints
  atom_hint=(@=0x7f25d42835a8 err=38 resol=1 used=1 token=1192 org=1 lvl=2 txt=RESULT_CACHE ())
********** WARNING: SOME HINTS HAVE ERRORS *********
====================== END SQL Statement Dump ======================

That is, the optimizer considers RESULT_CACHE as a hint and errors out explicitly telling that it cannot parse it: err=38. I’ve tried the FULL hint to compare and for this one err=0. It is easy to see that it is not a configuration related problem, for example, minimum data size to be eligible for caching and etc. since the optimizer says that it faces an error. Moreover, just to exclude size related limitations, if any, I tried with a large data size – no effect.

To be continued…

FILLFACTOR parameter in PostgreSQL (Part II – source code hacking)

In Part I of this series I was investigating the effect of fillfactor parameter in Postgres 14.1. It turned out that the database does not always store the updated version of a tuple on a new page whereas it should, at least because of the default value of fillfactor=100.

Now, I will investigate the source code to see what prevents the database from doing it so.

This time I used a table with less number of columns to exclude the The Oversized-Attribute Storage Technique (TOAST). This is how PostgreSQL manages to store large field values since it does not support spanning a tuple across multiple pages.

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30)

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

Size of the table is 8KB and fillfactor has default value of 100.

select * from pg_total_relation_size('fillfactortable');
 8192 bytes
(1 row)

To find out how a tuple is updated let’s take a look at the source code. The routine that does it starts at line 3155 in \src\backend\access\heap\heapam.c:

heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
			CommandId cid, Snapshot crosscheck, bool wait,
			TM_FailureData *tmfd, LockTupleMode *lockmode)

Here, the following function is called to do the actual update (line 3926).

RelationPutHeapTuple(relation, newbuf, heaptup, false);

The newbuff parameter’s value decides whether the updated version of the tuple will be on the same page or not. This decision is based on the following condition (line 3670):

pagefree = PageGetHeapFreeSpace(page);
newtupsize = MAXALIGN(newtup->t_len);
if (need_toast || newtupsize > pagefree)
		/* No TOAST work needed, and it'll fit on same page */
		newbuf = buffer;
		heaptup = newtup;

In my scenario need_toast is false since the table has just six attributes of varchar(30). Moreover, should there be another page for TOAST the pg_total_relation_size function would return 16KB since additional page would have been allocated. This can be confirmed also by querying pg_class with the corresponding value for oid column. That is, newtupsize > pagefree evaluating to false is the only case when the new version of the tuple is placed on the same page since it is that case newbuf = buffer. This is actually what happens because pagefree receives a value greater than newtupsize.

To investigate it further, let’s see how PageGetHeapFreeSpace works which can be found at line 899 in \src\backend\storage\page\bufpage.c:

PageGetFreeSpace(Page page)
	int			space;

	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
	 * pd_upper.
	space = (int) ((PageHeader) page)->pd_upper -
		(int) ((PageHeader) page)->pd_lower;

	if (space < (int) sizeof(ItemIdData))
		return 0;
	space -= sizeof(ItemIdData);

	return (Size) space;

Here, pd_upper and pd_lower is offset of end of free space and start of free space in a page relatively. By doing this subtraction pd_upperpd_lower it calculates available space between between the end of line pointers and start of tuple values. This condition then (line 13 in the last block) if (space < (int) sizeof(ItemIdData)) checks how much space is left ultimately after deducting space taken by the line pointers.

Let’s what the values are for pd_upper and pd_lower in the table (Heads up! No fillfactor involvement so far!):

SELECT lower, upper FROM page_header(get_raw_page('fillfactortable', 0));
lower       upper
--------   ----------
 28        8096
(1 row)

That is, the page has 8096-28=8068 bytes free. From this the space taken by the single line pointer (4 bytes) has to be subtracted as well which gives 8064 value for the pagefree parameter which is greater than the tuple size which is itself few tens of bytes in size.

A table with fillfactor=100 does not place the new version of the updated tuple on the same page what should not be the case according to the documentation. The tuple update routine makes use of PageGetFreeSpace function to decide whether the new version of the updated tuple resides on the same page. Neither the caller nor the routine itself make use fillfactor value. The routine calculates free spaced based on pd_upper and pd_lower and line pointers. In this scenario, pagefree should have been zero in which case the database would allocate additional page and place the new version of the tuple on this new page. As demonstrated, this did not happen even though fillfactor was set to 100. In Part III of this post series I will investigate cases when it does occur.

FILLFACTOR parameter in PostgreSQL (Part I)

The purpose of FILLFACTOR in PostgreSQL is widely known. According to the documentation, the default value of 100 means complete packing. That is, the new version of a tuple will be placed on a different page. Here is a simple experiment that demonstrates that this is not always the case.

Here is the experimental table:

create table fillfactortable(
      col1 varchar(30),col2 varchar(30),col3 varchar(30),col4 varchar(30),col5 varchar(30),col6 varchar(30),
      col7 varchar(30),col8 varchar(30),col9 varchar(30),col10 varchar(30),col11 varchar(30),col12 varchar(30),
      col13 varchar(30),col14 varchar(30),col15 varchar(30),col16 varchar(30),col17 varchar(30),col18 varchar(30),
      col19 varchar(30),col20 varchar(30)
      ) ;

As seen, the FILLFACTOR is not explicitly specified i.e. default value of is used. Let’s insert a single row

insert into fillfactortable (
    col1, col2, col3,col4,col5,col6, col7,col8,col9,col10,col11,col12,
    col13,col14,col15,col16,col17,col18, col19,col20
    left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),	left(md5(i::text), 10),
	left(md5(i::text), 10),	left(md5(i::text), 10)
from generate_series(1, 1) s(i);

followed by checking the number of pages in the table.

select pg_size_pretty (pg_relation_size('fillfactortable'));
 8192 bytes
(1 row)

The table has only one page. Now, let’s update the tuple.

update fillfactortable set col1 = 'random';

This results in a dead tuple which can be confirmed via the following query:

select n_live_tup, n_dead_tup, relname from pg_stat_all_tables where relname = 'fillfactortable';
 n_live_tup | n_dead_tup |     relname
          1 |          1 | fillfactortable
(1 row)

Now we can check the table size again:

select pg_size_pretty (pg_relation_size('fillfactortable'));
 8192 bytes
(1 row)

That is, the table still has only one page i.e. no page has been allocated to store the other version of the updated tuple which should have not happened according to the documentation. In Part II of this topic, I will explore the cases when this happens.

On Node Addition in Apache Cassandra

The purpose of this post is not to reiterate but rather restate how simple is adding a node to a running Cassandra cluster. I will not pretend that this is the whole thing that needs to be done once the node has been added. For example, the cluster will start streaming data to the newly added node which is not always cheap and which must be taken care of once the node has joined the cluster. Nevertheless, the procedure itself is pretty straightforward and the following is the minimum one must do to add a node to a running Cassandra cluster.

Everything below runs on my laptop using Virtual Box VM’s. Here is a cluster of two nodes running on CentOS 7.9:

[centos@centos7 cassandra]$ nodetool status
Datacenter: Cassandra
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  292.99 KiB  1            100.0%            c9504a89-653b-4e9d-a289-3444fc608a3b  rack1
UN  218.35 KiB  1            100.0%            de35416a-256a-445b-ab8d-c6a03da29eee  rack1

The cluster is up and running. What we want to do now is to add the third node. Assuming the machine has been prepared the steps are:
1. Modify the cassandra.yaml file.
2. Start the instance.

That’s it! As simple as that. The minimum that must be changed in the cassandra.yaml file is the seeds and listen_address sections. Seed node is not a specific node. It can be any node(s) from the cluster which the joining node needs to contact to in order to communicate cluster information. Listen_address is the address that nodes use to communicate with each other.

The new node’s IP address is for which is specified as the seed.  Here is how the seeds section looks like after the change.

seed_provider: # Addresses of hosts that are deemed contact points. 
               # Database nodes use this list of hosts to find each other and learn 
               # the topology of the ring. You _must_ change this if you are running 
               # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: 
               # seeds is actually a comma-delimited list of addresses. 
               # Ex: "<ip1>,<ip2>,<ip3>" 
               - seeds: ""

and here is how I changed the listen_address section

# Address or interface to bind to and tell other nodes to connect to.
# You _must_ change this address or interface to enable multiple nodes to communicate!
# Set listen_address OR listen_interface, not both.
# When not set (blank), InetAddress.getLocalHost() is used. This
# will always do the Right Thing _if_ the node is properly configured
# (hostname, name resolution, etc), and the Right Thing is to use the
# address associated with the hostname (it might not be).
# Setting listen_address to is always wrong.

Now the cassandra instance can be started after which it can be seen that the cluster is comprised of three nodes:

[centos@centos7 ~]$ nodetool status
Datacenter: Cassandra
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  306.67 KiB  1            35.7%             c9504a89-653b-4e9d-a289-3444fc608a3b  rack1
UN  281.72 KiB  1            85.4%             de35416a-256a-445b-ab8d-c6a03da29eee  rack1
UN  326 KiB    1            78.9%             88965b8c-1ba0-4d70-a034-616001795044  rack1

Node addition is completed.

In Favor of Temporary Tablespaces

Imagine a query that faces PGA shortage and heavily spills to temporary tablespace to complete (sorting, hashing and etc.). A typical solution is to tune the query, whereas the following can also act as a way out.
The environment is Oracle 19.3 on Linux 6.4. A test table with ~4 million rows is indexed on one of the columns with temporary tablespace datafiles residing on disk. Index creation requires sorting and the volume of data (4 million rows) forces disk spills. In such a case, 11904 temporary datafile blocks have been allocated and the query completes in 16 seconds.
Then, we load the temporary datafiles in /tmpfs on Linux which caches it in the RAM. Oracle is actually agnostic where the files reside: disk or memory. This time the index creation (started from scratch) spills to 11904 blocks just like before but the query runs two times faster. OS analysis also confirms that the blocks reside in memory this time.

The World’s Shortest Troubleshooting Scenario

The problem:  A few minute spike on control file sequential read event.
The cause:      A logon storm.
The proof:      A logon trigger querying v$database (that’s a controlfile access). Hence, the wait event.
The solution:  Either remove the query from the trigger or keep the query and find out the reason of excessive logons at that interval given that it will be eliminated on that side. The latter is the correct way to go.

Tunable PGA analogue in Hadoop

Assume your query needs to do sorting. If the sort memory size (derived from pga_aggregate_target) is not big enough to complete the whole sorting in memory, Oracle will spill to disk to it. The sorting process will go though multiple passes to sort the whole data.

PGA is a private memory area allocated exclusively to a process. Memory chunks in it are categorized into tunable and untunable. The latter is the memory area allocated for pl/sql collections, varchar types and etc. The former can be controlled by adjusting corresponding parameters. When pga_aggregate_target is set to a non-zero value, Oracle adjusts certain internal (undocumented) parameters which is then used for making decisions when allocating memory to certain operations. For example, the memory area used for hash join is one of those derived from it. Another one is memory used for sortingThe tunable part can be controlled.

Regarding Hadoop…MapReduce works in two phases. Map phase prepares data by converting it into a collection of key-value value pairs which are then consumed by reducers. This phase sorts data by key before passing it to reducers. Sorting may spill to disk, if the sort buffer is not enough to complete the sorting entirely in memory. So, in order to fasten the job itself one may want to consider increasing the (defaulting to 100 MB) parameter to give extra memory to do the sort. This is a potential tuning trick if you know that the slowness of the job is indeed related to the fact that the map phase spills multiple times to the disk to complete. That said, this parameter in Hadoop is nothing but _smm_max_size in Oracle.

More Hadoop tricks in the next posts…

Troubleshooting Slowness Associated with Log Activity (again briefly)

One of databases running on Oracle 12.1 was hanging for some minutes. The top wait event was log file sync. This wait event is posted whenever a user session issues COMMIT or ROLLBACK. A message is then sent to LGWR to flush logs associated with the transaction from the log buffer to redo log files. Redo storage slowness can cause high number of this wait event too. Ultimate blocker in either case is LGWR.
In most cases (but not always), high commit frequency is the cause and should be resolved (decreased) on the application side. That was not the case this time. The reason was the fact that a session (just a single one) issued lots DML statements which generated huge amount of redo. Once the session issued commit, it took quite a long for LGWR to flush redo from the log buffer to the disk. That stalled all other sessions because LGWR was busy and committing sessions were stuck as well. The transaction logic must be changed in the application to issue intermittent commits and not to commit only once when it ends.

Library cache: mutex X and another slowness scenario (briefly)

One of databases was eventually going completely unresponsive and didn’t resume normal execution until the suspected SQL was manually flushed from the shared pool. AWR indicates that library cache: mutex X wait event prevails. The fact that this mutex is taken whenever the corresponding library cache bucket is changed is well known.

So, the first suspect in case of this wait event is hard parse that can be caused by, for example, undersized shared pool, high version counts and etc. Another cause can be CPU saturation which prevents a thread from jumping on the CPU so that the necessary mutex will be released. This time no hard parses at all. At least, not so much as to worry about.

Looks like there are other cases (not hard parse related) that can cause the library cache bucket mutex to be taken exclusively. For example, execution count of a statement which is held in the parent statement. Whenever a statistics need to be changed the mutex must be held in exclusive mode. This fits very well when we observe that a statement executed ~30,000,000 times, during the other occurrence of the issue  ~28,000,000 times within 10 minutes. Once this loop will be removed/replaced, the issue will most probably be gone.

DBWR and the Shared Pool – Part II

Buffer cache from SGA holds data blocks that a server process reads from disk storage. This is well known. Access to these data blocks in the memory is done via buffer headers that store data block metadata, such as touch count, dirty flag and etc. For example, when a server process needs to access a particular data block in memory the DBA is hashed and bucket where that buffer header would be is returned. Then the latch: cache buffers chains latch that protects this bucket from simultaneous access is acquired. Now, this latch is set not on the buffers themselves but their headers, on the array of the headers, to be more precise. The latter contains pointers to actual data buffers. So, once you land on a header in the chain, you follow the pointer to jump on the actual buffer frame.

DBW also eventually accesses buffer header array, for example, during a checkpoint: it needs to know which buffers are dirty and which are not. Array of buffer headers is externalized in x$bh which comes in handy when troubleshooting, for example, latch contention. It contains lots of other useful information which actually tells you that buffer header array is implemented as a doubly linked list (NXT_HASH and PRV_HASH columns).
According to some sources this structure is stored in the buffer cache itself. Other sources state that it lies in the shared pool.

Now, back to prior post named DWB and the shared pool where I wondered why the DBW should access the shared pool. So, if the buffer header array is indeed stored in the shared pool and not the buffer cache then it makes that clear. Remains to prove it, if so.

In Part III, I’ll come up with quick demos to show whether the buffer header chain is indeed in the shared pool or not. Stay tuned…