What might be the cause of regular periodic slow transactions on an OLTP system

oracleoracle-11g-r2performancerac

I have a two node Oracle RAC cluster with that runs an oltp application against it. I have both a production and a duplicate test environment with the same setup. Upon running stress tests on the Test environment (about 160 tps), The average latency is about 70ms. I noticed that when I examined the maximum transaction time within each minute, that there is a regular spike in the maximum transaction time every 5 minutes of several transactions that take about 1200-1500 ms. I've traced the individual long transactions down to the second and them seem to last for anywhere from 3-5 seconds. There don't seem to be any regular running jobs that would effect the transactions. In environment, we see the same thing and in fact once in a while, we see bigger spikes of transactions taking 15-20 seconds (15000-20000 ms) to complete. While our production environment is in a pilot mode right now, and only pushes about 4 tps, we do still see these spikes where maximum transaction lengths reach 1200-1500 ms, and they also seem to occur on 5 minute intervals.

I have performed ASH analytics on the small few second intervals, and it seems that there is no common wait event that occurs during these spikes. Sometimes we see "log file sync" events, sometimes we see "gc busy" or other "gc" wait events. It seems to vary. I'm just trying to figure out how to diagnose the underlying cause of this periodic slowness. Any ideas would be appreciated.

Best Answer

High Log File Sync wait events mean that throughput on writing to the online redo logs is not high enough. This blog gives a great explanation of how to diagnose and troubleshoot that event:

http://logicalread.solarwinds.com/oracle-log-file-sync-wait-event-dr01/#.V_TSTvkrKUk

GC Busy and other GC wait events indicate RAC cluster related waits. We'd need to see a lot more detail to troubleshoot this accurately. That said, if you are having Log File Sync waits on some or all nodes, most likely the other nodes will start showing GC wait events. The nodes have to cooperate with each other. When one node has problems, the whole cluster is affected.