Intelligent Database Management Systems

Machine Learning (ML) has been around for decades. Lots of applications of it can be provided. The idea here is to feed the agent with the ability to learn by itself.  We don’t program all possible if-then-else cases but rather provide samples and the entity learns how to behave from this point on (supervised learning), or drop the entity in to an environment and then punish or reward it based on its behavior and it becomes smarter as it hits its head against the wall oftener and oftener (reinforcement leaning), or lastly given collection of data, the agent thinks about properly classifying it or splitting them into possible groups (unsupervised learning) that share similar properties. Take a look at this to see how a robot learns to flip pancakes after 50 trials. It does impress.

Database Management Systems (DBMS) have some candidate modules for integrating with ML. Query Optimizer can serve as an example of it. A query may have multiple plans for execution. The job of the optimizer is to pick up a plan with the least cost. Cost may include many metrics, such as number of I/O or number of rows processed (for In-Memory DBMS). We can treat the optimizer as agent/robot. And the system as the environment. Cost can serve as reward. The lesser the cost, the more we reward the optimizer. So, even though this time a particular cost was picked up, the optimizer will try to improve it the next time. In this way, the optimizer can learn from its behavior itself rather than being programmed for most of possible cases.

Harder to do than said. In theory, this would work. In practice….it needs a careful design and lots of experimentation. The hardest part would be to map optimizer’s behavior onto reinforcement learning paradigm. Environments are so many and very different.

Computer science associate professor in MIT Tim Kraska briefly speaks about how DBAs can be released from index management in databases if machine learning used. Once data distribution changes, it could be worth to reconsider indexes. Index management module can learn to pick up columns to be indexed as data in the database changes over time.

Recently, group of researchers and students at Carnegie Mellon University have developed a tool named OtterTune that aims at simplifying DBMS deployment. The tool has a repository that keeps data from historical tuning sessions. This experience is used then by the ML engine to develop new, appropriate recommendations for new configuration depending on application objective. In the corresponding paper, authors conduct experiments on three DBMSs: MySQL, Postgres and Vector.

Oracle introduced some adaptive features in 12c which is a step towards intelligent tuning. One of them is adaptive execution plan feature which, briefly put, works as follows. An SQL statement is executed with an initially generated plan and candidate plans are kept in the cursor of the query. Say, for a join in the query Nested Loop (NL) is the initial one. So, when the execution reaches this join point, rows are buffered and compared with initial estimates. If comparison reaches certain threshold, then the NL is replaced with Hash Join which would be a smarter decision for this case. Thus, a join method switch happens on the fly, at run time. Execution considers feedback. Subsequent executions of the same SQL query then start using this plan from then on.

Database Management Systems leave a large room for application of machine learning methods. Candidates can be query optimization, database recovery and checkpointing, memory management and instance tuning. Along with database knowledge, ML skills for future DBAs might be a plus.

Latching and Result Cache – A Quick Troubleshooting Case

One of routine processes in an Oracle 11.2 database in one of the banks used to run just fine (~20 minutes) until a restart of the machine. From then on, the same process completed in around 4 hours. Symptom was a typical one, related to latches and workaround was straightforward, though a little tricky.  Yet another thing is that I didn’t save exact figures from that problem so below numbers are made up just from memory.

So, the first step was to see what wait events prevailed at that period. Quite a number of latch free wait events were at the top:

 ComponentWaitsTime(s)Avg wait (ms)% DB timeWait Class
latch free235,59824,906106309.52Other
DB CPU 1,039 12.91 
db file sequential read55,74441875.19User I/O
control file sequential read160,88116612.07System I/O
log file sync1141130.02Commit

Recall, when Oracle posts latch free wait event. Once a latch acquisition request is made and latch is not free, then the process spins for some number of times ( 10 _spin_count for exclusive latches) to get the latch. If yet not successful, then the processes sleeps until the lock holder notifies it upon latch release. This applies to all latches of class 0, by default. This behavior can also be changed by modifying _enable_reliable_latch_waits undocumented parameter. When set to false, the process will wake up periodically to spin to check if the latch is free.

So, when spinning is exhausted latch free wait event is published meaning that latch couldn’t be acquired in the intended period. The next was to find out what latch prevailed. Unfortunately, I don’t have AWR report data for this problem exactly, but the numbers were much worse than the following:

Latch NameGet RequestsMissesSleepsSpin Gets
Result Cache: RC Latch1,862,968288,200273,53061,310
space background task latch7,2522,098722,035
qmn task queue latch1,0101368128
cache buffers chains4,119,11342633
managed standby latch96615
call allocation3,344211
resmgr:active threads3,572110

From here it became apparent that the cause is related to result caching. When this feature is on, result of an SQL query is cached to avoid data retrieval for a similar query. The memory region allocated for the result cache of a query is protected by a latch. Further examination revealed that none of the queries in the system uses this feature via hints; nor this was on at the database level (RESULT_CACHE_MODE = MANUAL). Lots of queries were having RESULT_CACHE in their execution plans. But it shouldn’t be there as per Oracle docs. So, the feature was not supposed to be used for any query in the database. I had to turn off something that is already off, at least to have latches gone. From the other hand, RESULT_CACHE_MAX_SIZE parameter value was non-zero, around 40 MB. So, I’ve decided to zero out this value to disable this. This worked. Latches were gone. The process returned to its past behavior and now it runs for 20 minutes again.

This was not the cause resolution though but rather a workaround. The question that remains is why it was running fine before restart? Also, assuming that result cache should be there as per application requirement, the interesting research would be to find out why it caused that much latching. It could be related to exclusive latches, when underlying queries are run with different variable values, for example. As for the first question, according to Oracle docs, result_cache_max_size is calculated from the values of SHARED_POOL_SIZE, SGA_TARGET and MEMORY_TARGET depending on whether ASMM or MSMM is used. It could, for example, be that another pfile was picked up upon restart that force recalculation of the cache size. Need to analyze it yet…