Latching and Result Cache – A Quick Troubleshooting Case

One of routine processes in an Oracle 11.2 database in one of the banks used to run just fine (~20 minutes) until a restart of the machine. From then on, the same process completed in around 4 hours. Symptom was a typical one, related to latches and workaround was straightforward, though a little tricky. Yet another thing is that I didn’t save exact figures from that problem so below numbers are made up just from memory.

So, the first step was to see what wait events prevailed at that period. Quite a number of latch free wait events were at the top:

Component	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
latch free	235,598	24,906	106	309.52	Other
DB CPU		1,039		12.91
db file sequential read	55,744	418	7	5.19	User I/O
control file sequential read	160,881	166	1	2.07	System I/O
log file sync	114	1	13	0.02	Commit

Recall, when Oracle posts latch free wait event. Once a latch acquisition request is made and latch is not free, then the process spins for some number of times ( 10 _spin_count for exclusive latches) to get the latch. If yet not successful, then the processes sleeps until the lock holder notifies it upon latch release. This applies to all latches of class 0, by default. This behavior can also be changed by modifying _enable_reliable_latch_waits undocumented parameter. When set to false, the process will wake up periodically to spin to check if the latch is free.

So, when spinning is exhausted latch free wait event is published meaning that latch couldn’t be acquired in the intended period. The next was to find out what latch prevailed. Unfortunately, I don’t have AWR report data for this problem exactly, but the numbers were much worse than the following:

Latch Name	Get Requests	Misses	Sleeps	Spin Gets
Result Cache: RC Latch	1,862,968	288,200	273,530	61,310
space background task latch	7,252	2,098	72	2,035
qmn task queue latch	1,010	136	8	128
cache buffers chains	4,119,113	42	6	33
managed standby latch	96	6	1	5
call allocation	3,344	2	1	1
resmgr:active threads	3,572	1	1	0

From here it became apparent that the cause is related to result caching. When this feature is on, result of an SQL query is cached to avoid data retrieval for a similar query. The memory region allocated for the result cache of a query is protected by a latch. Further examination revealed that none of the queries in the system uses this feature via hints; nor this was on at the database level (RESULT_CACHE_MODE = MANUAL). Lots of queries were having RESULT_CACHE in their execution plans. But it shouldn’t be there as per Oracle docs. So, the feature was not supposed to be used for any query in the database. I had to turn off something that is already off, at least to have latches gone. From the other hand, RESULT_CACHE_MAX_SIZE parameter value was non-zero, around 40 MB. So, I’ve decided to zero out this value to disable this. This worked. Latches were gone. The process returned to its past behavior and now it runs for 20 minutes again.

This was not the cause resolution though but rather a workaround. The question that remains is why it was running fine before restart? Also, assuming that result cache should be there as per application requirement, the interesting research would be to find out why it caused that much latching. It could be related to exclusive latches, when underlying queries are run with different variable values, for example. As for the first question, according to Oracle docs, result_cache_max_size is calculated from the values of SHARED_POOL_SIZE, SGA_TARGET and MEMORY_TARGET depending on whether ASMM or MSMM is used. It could, for example, be that another pfile was picked up upon restart that force recalculation of the cache size. Need to analyze it yet…

Leave a Reply Cancel reply