Sql-server – BobMgr::GetBuf: Sort Big Output Buffer write not complete after 60 seconds

sql server

One of our production servers showing below error msg in Sql Server Log from last week,

SQL Server has encountered 20674 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\Databasefile.mdf] in database [Database] (11). The OS file handle is 0x0000000000000828. The offset of the latest long I/O is: 0x00001c3bde0000
and today in addition to the above one more msg,

BobMgr::GetBuf: Sort Big Output Buffer write not complete after 60 seconds.

is this the indication of a serious issue? How can we resolve this?

Best Answer

BobMgr::GetBuf: Sort Big Output Buffer write not complete after 60 seconds.

Big Output Buffers (BoBs) are used when a sort spills to tempdb.

The 'big' buffer optimizes for large sequential I/O instead of writing sort records individually.

A delay of this length indicates a problem with the disk subsystem.

Related Solutions

Sql-server – ~2 Hours after a large insert SQL Server gets “I/O requests taking longer than 15 seconds to complete”

BOL refers to 833 event ID as :

This problem can be caused system performance issues, hardware errors, firmware errors, device driver problems, or filter driver intervention in the IO process.

What it means is that

SQL server is doing more I/O's that what the disk subsystem can handle.
There could be some rogue process running on the system that is saturating the disks with all the I/O requests.

Below is the approach I would follow:

Troubleshoot using Wait Statistics - DMV's

WITH [Waits] AS
    (SELECT
        [wait_type],
        [wait_time_ms] / 1000.0 AS [WaitS],
        ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
        [signal_wait_time_ms] / 1000.0 AS [SignalS],
        [waiting_tasks_count] AS [WaitCount],
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_wait_stats
    WHERE [wait_type] NOT IN (
        N'CLR_SEMAPHORE',    N'LAZYWRITER_SLEEP',
        N'RESOURCE_QUEUE',   N'SQLTRACE_BUFFER_FLUSH',
        N'SLEEP_TASK',       N'SLEEP_SYSTEMTASK',
        N'WAITFOR',          N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
        N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
        N'XE_TIMER_EVENT',   N'XE_DISPATCHER_JOIN',
        N'LOGMGR_QUEUE',     N'FT_IFTS_SCHEDULER_IDLE_WAIT',
        N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT',
        N'CLR_AUTO_EVENT',   N'DISPATCHER_QUEUE_SEMAPHORE',
        N'TRACEWRITE',       N'XE_DISPATCHER_WAIT',
        N'BROKER_TO_FLUSH',  N'BROKER_EVENTHANDLER',
        N'FT_IFTSHC_MUTEX',  N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        N'DIRTY_PAGE_POLL',  N'SP_SERVER_DIAGNOSTICS_SLEEP')
    )
SELECT
    [W1].[wait_type] AS [WaitType],
    CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S],
    CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S],
    CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S],
    [W1].[WaitCount] AS [WaitCount],
    CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage],
    CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S],
    CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S],
    CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
    ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
    [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold
GO

Look at the pending I/O's

SELECT database_id
    ,file_id
    ,io_stall
    ,io_pending_ms_ticks
    ,scheduler_address
FROM sys.dm_io_virtual_file_stats(NULL, NULL) t1
    ,sys.dm_io_pending_io_requests AS t2
WHERE t1.file_handle = t2.io_handle

Below are PERFMON counters that will help you:

Avg. Disk sec/Transfer => Time taken to perform the I/O operation
Disk Bytes /sec => Total reads and writes to disk per second in bytes.
Process:IO Data Bytes/Sec => Total reads and writes to disk per second in bytes by each process.
Buffer Manager: Page Read/sec + Page Writes/sec =>Total reads and writes to disk per second in bytes by SQL Server process.

Also, sometimes such errors can also occur when a different Power Management scheme is selected for the server. So better check your Windows Power Plan setting, and make sure you are using the High Performance Power Plan.

All three stored procedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn't present itself but I feel there is an underlying problem here).

If you are inserting a huge amount of data, then you need to refer to The Data Loading Performance Guide to cater your server for such operations e.g. enabling trace flag 610.

Note: Others have mentioned that Auto Update Stats might be causing it, which is also worth looking at, but I would primarily concentrate on finding out rouge process that is causing lots of Disk I/Os and checking the storage subsystem to see if all the divers, etc are up-to-date and is configured correctly.

Good reading links :

Sql-server – SQL Server: How to get the responsiveness of a trigger, without blocking the transaction that fires the trigger

You might be able to use the service broker, though that is probably overkill, as you state.

Alternately, if you don't want to install/manage an extra service for this one need, you could use xp_cmdshell or a CLR-based trigger to make an external call when needed, to a program that starts the desired process asynchronously (so, outside the transaction that called the trigger) and returns immediately. I would be more wary of this than simply putting some complex code in a trigger though – at least, with a trigger any good SQL Server DBA person would be able to support the code when you are not around rather than requiring some other specific programming experience.

When you say that polling simply isn't an option as an "up to 30 seconds" delay is unacceptable, is there a reason you can't poll more often? If the concern is the load the polling will impart on the server(s) then there are ways to keep the polling very light-weight. No matter what you are going to need the trigger (or something else) to decide if action is needed at the point rows are inserted - if you make this decision in a trigger and simply note it in a simple status table then you could poll every second-or-few to wait for that status marker being updated (a DB hit each second to perform a simple lookup like that once every few seconds (or even every second, or more often) is not going to impart significant load, unless you are using some cloud DB service that charges you by the query). Again, this is far from ideal, but it seems safer to me than the xp_cmdshell option.

Whatever you do, be careful to make sure failures in the process are flagged appropriately. As you are pulling the extra processing out of the main transaction, you can't rely on your existing exception handling to tell you that it failed for some reason – so, you might not know that you have a growing data consistency problem. The key advantage of the trigger here is that by staying in the same transaction you know your data is kept consistent, as one half won't be committed if the other half fails.

Best Answer

Related Solutions

Sql-server – ~2 Hours after a large insert SQL Server gets “I/O requests taking longer than 15 seconds to complete”

Sql-server – SQL Server: How to get the responsiveness of a trigger, without blocking the transaction that fires the trigger

Related Question