One of routine processes in an Oracle 11.2 database in one of the banks used to run just fine (~20 minutes) until a restart of the machine. From then on, the same process completed in around 4 hours. Symptom was a typical one, related to latches and workaround was straightforward, though a little tricky. Yet another thing is that I didn’t save exact figures from that problem so below numbers are made up just from memory.
So, the first step was to see what wait events prevailed at that period. Quite a number of latch free wait events were at the top:
|Component||Waits||Time(s)||Avg wait (ms)||% DB time||Wait Class|
|db file sequential read||55,744||418||7||5.19||User I/O|
|control file sequential read||160,881||166||1||2.07||System I/O|
|log file sync||114||1||13||0.02||Commit|
Recall, when Oracle posts latch free wait event. Once a latch acquisition request is made and latch is not free, then the process spins for some number of times ( 10 _spin_count for exclusive latches) to get the latch. If yet not successful, then the processes sleeps until the lock holder notifies it upon latch release. This applies to all latches of class 0, by default. This behavior can also be changed by modifying _enable_reliable_latch_waits undocumented parameter. When set to false, the process will wake up periodically to spin to check if the latch is free.
So, when spinning is exhausted latch free wait event is published meaning that latch couldn’t be acquired in the intended period. The next was to find out what latch prevailed. Unfortunately, I don’t have AWR report data for this problem exactly, but the numbers were much worse than the following:
|Latch Name||Get Requests||Misses||Sleeps||Spin Gets|
|Result Cache: RC Latch||1,862,968||288,200||273,530||61,310|
|space background task latch||7,252||2,098||72||2,035|
|qmn task queue latch||1,010||136||8||128|
|cache buffers chains||4,119,113||42||6||33|
|managed standby latch||96||6||1||5|
From here it became apparent that the cause is related to result caching. When this feature is on, result of an SQL query is cached to avoid data retrieval for a similar query. The memory region allocated for the result cache of a query is protected by a latch. Further examination revealed that none of the queries in the system uses this feature via hints; nor this was on at the database level (RESULT_CACHE_MODE = MANUAL). Lots of queries were having RESULT_CACHE in their execution plans. But it shouldn’t be there as per Oracle docs. So, the feature was not supposed to be used for any query in the database. I had to turn off something that is already off, at least to have latches gone. From the other hand, RESULT_CACHE_MAX_SIZE parameter value was non-zero, around 40 MB. So, I’ve decided to zero out this value to disable this. This worked. Latches were gone. The process returned to its past behavior and now it runs for 20 minutes again.
This was not the cause resolution though but rather a workaround. The question that remains is why it was running fine before restart? Also, assuming that result cache should be there as per application requirement, the interesting research would be to find out why it caused that much latching. It could be related to exclusive latches, when underlying queries are run with different variable values, for example. As for the first question, according to Oracle docs, result_cache_max_size is calculated from the values of SHARED_POOL_SIZE, SGA_TARGET and MEMORY_TARGET depending on whether ASMM or MSMM is used. It could, for example, be that another pfile was picked up upon restart that force recalculation of the cache size. Need to analyze it yet…