What might be the cause of regular periodic slow transactions on an OLTP system

oracleoracle-11g-r2performancerac

I have a two node Oracle RAC cluster with that runs an oltp application against it. I have both a production and a duplicate test environment with the same setup. Upon running stress tests on the Test environment (about 160 tps), The average latency is about 70ms. I noticed that when I examined the maximum transaction time within each minute, that there is a regular spike in the maximum transaction time every 5 minutes of several transactions that take about 1200-1500 ms. I've traced the individual long transactions down to the second and them seem to last for anywhere from 3-5 seconds. There don't seem to be any regular running jobs that would effect the transactions. In environment, we see the same thing and in fact once in a while, we see bigger spikes of transactions taking 15-20 seconds (15000-20000 ms) to complete. While our production environment is in a pilot mode right now, and only pushes about 4 tps, we do still see these spikes where maximum transaction lengths reach 1200-1500 ms, and they also seem to occur on 5 minute intervals.

I have performed ASH analytics on the small few second intervals, and it seems that there is no common wait event that occurs during these spikes. Sometimes we see "log file sync" events, sometimes we see "gc busy" or other "gc" wait events. It seems to vary. I'm just trying to figure out how to diagnose the underlying cause of this periodic slowness. Any ideas would be appreciated.

Best Answer

High Log File Sync wait events mean that throughput on writing to the online redo logs is not high enough. This blog gives a great explanation of how to diagnose and troubleshoot that event:

http://logicalread.solarwinds.com/oracle-log-file-sync-wait-event-dr01/#.V_TSTvkrKUk

GC Busy and other GC wait events indicate RAC cluster related waits. We'd need to see a lot more detail to troubleshoot this accurately. That said, if you are having Log File Sync waits on some or all nodes, most likely the other nodes will start showing GC wait events. The nodes have to cooperate with each other. When one node has problems, the whole cluster is affected.

Related Solutions

Sql-server – SQL Server – Multiple running totals

Asynchronous implies that the running totals do not need to be completely accurate at all times, or your data change patterns are such that a one-off running total build will be valid and accurate until the next load. Anyway, I'm sure you have thought that part through, so I won't labour the point.

Your main options for a high-performance, supported, method are a SQLCLR function/procedure, or an UPDATE based on Hugo Kornelis' set-based iteration method. The SQLCLR method (implemented in a procedure, but reasonably easy to translate) can be found here.

I haven't been able to find Hugo's method online, but it is detailed in the excellent MVP Deep Dives (Volume 1). Sample code to illustrate Hugo's method (copied from one of my posts on another site you may not have a login for) is shown below:

-- A work table to hold the reformatted data, and
-- ultimately, the results
CREATE  TABLE #Work
    (
    Acct_No         VARCHAR(20) NOT NULL,
    MonthDate       DATETIME NOT NULL,
    MonthRate       DECIMAL(19,12) NOT NULL,
    Amount          DECIMAL(19,12) NOT NULL,
    InterestAmount  DECIMAL(19,12) NOT NULL,
    RunningTotal    DECIMAL(19,12) NOT NULL,
    RowRank         BIGINT NOT NULL
    );

-- Prepare the set-based iteration method
WITH    Accounts
AS      (
        -- Get a list of the account numbers
        SELECT  DISTINCT Acct_No 
        FROM    #Refunds
        ),
        Rates
AS      (
        -- Apply all the accounts to all the rates
        SELECT  A.Acct_No,
                R.[Year],
                R.[Month],
                MonthRate = R.InterestRate / 12
        FROM    #InterestRates R
        CROSS 
        JOIN    Accounts A
        ),
        BaseData
AS      (
        -- The basic data we need to work with
        SELECT  Acct_No = ISNULL(R.Acct_No,''),
                MonthDate = ISNULL(DATEADD(MONTH, R.[Month], DATEADD(YEAR, R.[year] - 1900, 0)), 0),
                R.MonthRate,
                Amount = ISNULL(RF.Amount,0),
                InterestAmount = ISNULL(RF.Amount,0) * R.MonthRate,
                RunningTotal = ISNULL(RF.Amount,0)
        FROM    Rates R
        LEFT
        JOIN    #Refunds RF
                ON  RF.Acct_No = R.Acct_No
                AND RF.[Year] = R.[Year]
                AND RF.[Month] = R.[Month]
        )
-- Basic data plus a rank id, numbering the rows by MonthDate, and resetting to 1 for each new Account
INSERT  #Work
        (Acct_No, MonthDate, MonthRate, Amount, InterestAmount, RunningTotal, RowRank)
SELECT  BD.Acct_No, BD.MonthDate, BD.MonthRate, BD.Amount, BD.InterestAmount, BD.RunningTotal,
        RowRank = RANK() OVER (PARTITION BY BD.Acct_No ORDER BY MonthDate)
FROM    BaseData BD;

-- An index to speed the next stage (different from that used with the Quirky Update method)
CREATE UNIQUE CLUSTERED INDEX nc1 ON #Work (RowRank, Acct_No);

-- Iteration variables
DECLARE @Rank       BIGINT,
        @RowCount   INTEGER;

-- Initialize
SELECT  @Rank = 1,
        @RowCount = 1;

-- This is the iteration bit, processes a rank id per iteration
-- The number of rows processed with each iteration is equal to the number of groups in the data
-- More groups --> greater efficiency
WHILE   (1 = 1)
BEGIN
        SET @Rank = @Rank + 1;

        -- Set-based update with running totals for the current rank id
        UPDATE  This
        SET     InterestAmount = (Previous.RunningTotal + This.Amount) * This.MonthRate,
                RunningTotal = Previous.RunningTotal + This.Amount + (Previous.RunningTotal + This.Amount) * This.MonthRate
        FROM    #Work This
        JOIN    #Work Previous
                ON  Previous.Acct_No = This.Acct_No
                AND Previous.RowRank = @Rank - 1
        WHERE   This.RowRank = @Rank;

        IF  (@@ROWCOUNT = 0) BREAK;
END;

-- Show the results in natural order
SELECT  *
FROM    #Work
ORDER   BY
        Acct_No, RowRank;

In SQL Server 2012, you could use the windowing function extensions e.g. SUM OVER (ORDER BY).

Sql-server – Auto Update Statistics Duration and Interruptions

You can grab this data with the Auto Stats Event Class with SQL Trace. One of the event's data columns is Duration. As per the referenced BOL documentation above, here is a quote:

Duration: Amount of time (in microseconds) taken by the event.

That should give you a pretty good idea when/how long auto stats is taking. If you are seeing a lot of performance-halting auto update stats (synchronously), you could consider utilizing auto update statistics async. Turning this on will tell the query optimizer to not wait for statistics to be updated in order to continue compiling the query. This is one solution in order to prevent the direct interference of stats updates, but of course you're sacrificing fresh statistics for that particular query compilation.

For more information on this, please see the MSDN reference on Statistics.

Best Answer

Related Solutions

Sql-server – SQL Server – Multiple running totals

Sql-server – Auto Update Statistics Duration and Interruptions

Related Question