Sql-server – ~2 Hours after a large insert SQL Server gets “I/O requests taking longer than 15 seconds to complete”

sql serversql-server-2012

This problem has been haunting me for a while but I actually fixed the queries that caused the problems (after I finally found them) but i'm curious if it's an SQL Process that's the underlying problem our that our raidcontrollers are shit or what the problem truely is. So here's the scenario that's been bothering us.

@ 11:10~ and 01:10~ the server would give these IO errors, even through Remote desktop the servers would fail, management studio would give timeouts, everything would point towards imminent failure of the cluster.

(The servers are identical, we run Windows Server 2012 standard, clustered in an AlwaysOn running SQL Server 2012 Enterprise edition).

Around 5-15 minutes later, the servers would come back to live (without anything in the eventlogs except for the Event ID 833 which is the "well known": SQL Server has encountered 16870 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\XXXXXXXX.mdf] in database [XXXXX] (10). The OS file handle is 0x0000000000001284. The offset of the latest long I/O is: 0x0000041fb3c000).

Now I was running profilers to find out which queries were running, doing anything I could to figure out what would cause the servers to just freeze up and refuse to service properly. The event log aims towards a specific Database to I searched in that direction and yesterday I found a table that looked odd to me.

The specific table was 11 gigs large, had indexes worth 23 gigs (haven't looked into those yet but that seemed wrong to me) and had 80 million records. Considering the table should hold around 2 million and be growing relatively slowly this sparked my interested and I found three stored procedures which fill this table.

All three storedprocedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn't present itself but I feel there is an underlying problem here).

These stored procedures would run @ 09:00 and 23:00 (excatly 2 hours before the problems occurred). The procedures themselves would take around 5 minutes to run to completion.

Now I'm wondering why it would take the servers two hours to start dying on me after the actual storedprocedures in question would already have been done. I don't run any index maintenance at that point, or anything else for that matter (it's the first thing I checked for). For all I know nothing runs @ the specific times. Could it be SQL Server does some 'self maintenance' and notices that some indexes need to be defragmentated or something even if I haven't asked for any such thing?

Could it be our raidcontrollers (or even the SSD's that are connected to them) that are flawed and after A LOT of activity decide to show signs of degration after 2 hours?

I simply can not explain this behavior and I'm worried our production servers might be unreliable (which would suck since they're around 4 months old). Now i've fixed this issue last evening, and for 24 hours we haven't had this problem (I wouldn't call it 'resolved' unless it's been stable for multiple days) but I'm still very curious what the hell my servers are doing.

Best Answer

BOL refers to 833 event ID as :

This problem can be caused system performance issues, hardware errors, firmware errors, device driver problems, or filter driver intervention in the IO process.

What it means is that

SQL server is doing more I/O's that what the disk subsystem can handle.
There could be some rogue process running on the system that is saturating the disks with all the I/O requests.

Below is the approach I would follow:

Troubleshoot using Wait Statistics - DMV's

WITH [Waits] AS
    (SELECT
        [wait_type],
        [wait_time_ms] / 1000.0 AS [WaitS],
        ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
        [signal_wait_time_ms] / 1000.0 AS [SignalS],
        [waiting_tasks_count] AS [WaitCount],
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_wait_stats
    WHERE [wait_type] NOT IN (
        N'CLR_SEMAPHORE',    N'LAZYWRITER_SLEEP',
        N'RESOURCE_QUEUE',   N'SQLTRACE_BUFFER_FLUSH',
        N'SLEEP_TASK',       N'SLEEP_SYSTEMTASK',
        N'WAITFOR',          N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
        N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
        N'XE_TIMER_EVENT',   N'XE_DISPATCHER_JOIN',
        N'LOGMGR_QUEUE',     N'FT_IFTS_SCHEDULER_IDLE_WAIT',
        N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT',
        N'CLR_AUTO_EVENT',   N'DISPATCHER_QUEUE_SEMAPHORE',
        N'TRACEWRITE',       N'XE_DISPATCHER_WAIT',
        N'BROKER_TO_FLUSH',  N'BROKER_EVENTHANDLER',
        N'FT_IFTSHC_MUTEX',  N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        N'DIRTY_PAGE_POLL',  N'SP_SERVER_DIAGNOSTICS_SLEEP')
    )
SELECT
    [W1].[wait_type] AS [WaitType],
    CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S],
    CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S],
    CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S],
    [W1].[WaitCount] AS [WaitCount],
    CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage],
    CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S],
    CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S],
    CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
    ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
    [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold
GO

Look at the pending I/O's

SELECT database_id
    ,file_id
    ,io_stall
    ,io_pending_ms_ticks
    ,scheduler_address
FROM sys.dm_io_virtual_file_stats(NULL, NULL) t1
    ,sys.dm_io_pending_io_requests AS t2
WHERE t1.file_handle = t2.io_handle

Below are PERFMON counters that will help you:

Avg. Disk sec/Transfer => Time taken to perform the I/O operation
Disk Bytes /sec => Total reads and writes to disk per second in bytes.
Process:IO Data Bytes/Sec => Total reads and writes to disk per second in bytes by each process.
Buffer Manager: Page Read/sec + Page Writes/sec =>Total reads and writes to disk per second in bytes by SQL Server process.

Also, sometimes such errors can also occur when a different Power Management scheme is selected for the server. So better check your Windows Power Plan setting, and make sure you are using the High Performance Power Plan.

All three stored procedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn't present itself but I feel there is an underlying problem here).

If you are inserting a huge amount of data, then you need to refer to The Data Loading Performance Guide to cater your server for such operations e.g. enabling trace flag 610.

Note: Others have mentioned that Auto Update Stats might be causing it, which is also worth looking at, but I would primarily concentrate on finding out rouge process that is causing lots of Disk I/Os and checking the storage subsystem to see if all the divers, etc are up-to-date and is configured correctly.

Good reading links :

Related Solutions

Sql-server – Poor performance after changing SQL Server 2008 R2 database

So the major differences I see on first glance is that the table scan on ARMiscIncome feeds a merge join in the slow version, which requires a sort:

But in the fast version it feeds a hash join (so no sort is required):

This sort is the highest overall cost in the slow plan, so I might suggest either finagling a hash join there (this could be simply due to different row counts in the two systems, even if statistics are out of date), or adding an index, such as this one which satisfies the output:

CREATE INDEX SortHelper ON dbo.ARMiscIncome(GuestID)
  INCLUDE (Amount, ChargeCode, DateEntered, Description, TranDate);

(Or changing the clustered index on that table.)

Usual caveats apply, e.g. adding/changing an index may help this specific query, but may not help others, and/or may cause undue stress on DML operations. But certainly worth thoroughly testing IMHO.

As an aside, I was able to find the most expensive operator by comparing the plans in SentryOne Plan Explorer. This is a free tool that makes exactly this type of investigation a lot more painless than it is in Management Studio.

Sql-server – Updating large table worked before, now taking hours

It might help a lot if you would have listed the indexes you have or the query plan. If the columns 4, 5 or 6 are indexed, that can slow the process down, and having a clustered index on columns 1, 2 and 3 should speed it up.

If columns 4,5 or 6 are variable length (e.g. varchar) then updating new value can cause a lot of page splits, that definitely will make it slower.

I think what John M meant was that if there's a reasonable amount of distinct values in for example column1, you could first collect the distinct into a temp. table and then run the update separately for each of the values because doing the joining the tables in smaller parts is usually a lot faster.

Of course everything depends on several things, like data types, number of distinct values in the columns 1-3 etc.

Best Answer

Related Solutions

Sql-server – Poor performance after changing SQL Server 2008 R2 database

Sql-server – Updating large table worked before, now taking hours

Related Question