While browsing through Oracle docs, an interesting troubleshooting case hit me that happened some time ago. So, I’ll briefly describe the scenario here. The case was related to mutexes and OS intervention in this.
A 24×7 11g R1 database was running on HP-UX. Eventually, there were high spikes related to Concurrency class, the top event was library cache: mutex X.
Recall, that this event is posted when a session is trying to acquire the mutex which protects a bucket in library cache in exclusive mode. The session cannot get the mutex because it is already being held in either shared or exclusive mode. Drill down of typical reasons of this wait event, such as, high hard parse rates, didn’t reveal anything. High spikes were caused by other sessions trying to repeatedly acquire the mutex. Spinning burnt CPU. There didn’t seem to be any reason for holding a mutex for long periods of time on database level.
Oracle support provides a script that can be used to find out which session is holding a mutex. The script was used to identify the holder but the session should not have kept the mutex for long time.
The next step was to suspect OS in this. On HP-UX one can use glance performance monitoring tool to check resource utilization of processes. A typical output looks like follows (the output is not related to the problem) :
The column Pri shows process priority. Whenever, a process is created it is assigned priority. The priority decides how much CPU time a process can consume. When a process is eating up too much of CPU, then the OS can lower its priority to provide fairness to other processes in the system so that they don’t starve. This is up to the OS scheduler to decide.
What caught eyes in that output then, was the fact that the priority of the process holding the mutex was very low compared to other processes. Because of resource shortage on the machine, the scheduler lowered the priority of some processes and one of these processes eventually was the one holding the mutex. This is known as priority inversion and has been dealt with in academia by quite a number of papers. In other words, the session could have released the mutex but couldn’t do it because it couldn’t jump back on CPU.
Oracle provides parameter HPUX_SCHED_NOAGE for setting priority of oracle processes with the default value of 178. This parameter was adjusted and the problem was gone. Further observations confirmed that oracle processes were assigned higher priorities which made the mutex holder to release it once it was done with it.