SQL Server 2008 R2 – High Wait Time on ASYNC_IO_COMPLETION

sql-server-2008-r2wait-types

I currently look at my wait stats. I followed best practices explained by Paul Randal and go through the documents he references.

Looking at the statistic slices I take once / day for a while now from all production servers, I see significant high values for Avg_Wait_S, meaning the relation of total Wait in Seconds divided by WaitCounts. The value for wait count is very low compared to all the other wait types (the lowest), but on each of my servers the AvgWait_S for this type is between 120 – 2493 seconds! This looks like a massive wait time.

I read this can anyhow be caused by backups. But I wonder if this is kind of a 'normal' value? How to get more details about it?

Here is a sample resultset of my wait stats:

WAITTYPE            Wait_S      Resource_S  Signal_S    WaitCount   Percentage  AvgWait_S   AvgRes_S    AvgSig_S 
BACKUPIO            46151.80    45093.07    1058.74     22819151    26.55       0.0020      0.0020      0.0000 
CXPACKET            45057.81    27926.59    17131.22    20387211    25.92       0.0022      0.0014      0.0008 
BACKUPBUFFER        16658.83    15867.71    791.12      8993341     9.58        0.0019      0.0018      0.0001 
PAGEIOLATCH_SH      15326.95    15284.65    42.30       2131848     8.82        0.0072      0.0072      0.0000 
ASYNC_IO_COMPLETION 14203.17    14203.17    0.00        9           8.17        1578.1303   1578.1303   0.0000 
WRITELOG            8570.20     8377.81     192.39      2136964     4.93        0.0040      0.0039      0.0001 
PAGEIOLATCH_EX      7691.32     7673.31     18.01       1777069     4.43        0.0043      0.0043      0.0000 
SOS_SCHEDULER_YIELD 4548.90     43.44       4505.47     13294384    2.62        0.0003      0.0000      0.0003 
LCK_M_S             3018.76     3018.06     0.71        1704        1.74        1.7716      1.7712      0.0004 
ASYNC_NETWORK_IO    2678.83     2147.78     531.05      4372825     1.54        0.0006      0.0005      0.0001 
LCK_M_SCH_S         2612.09     2612.08     0.01        37          1.50        70.5971     70.5968     0.0003

Best Answer

As mentioned

The value for wait count is very low compared to all the other wait types (the lowest)

the count being only 9, i.e. the wait has only showed up for 9 times, during the run of gathering wait stats for a complete day, wont be a concern here for me.

However, its worth checking the other wait types BACKUPIO and BACKUPBUFFER .

This wait type generally shows up when you are taking the backup over the network with not a good NW connectivity or on the tape etc (i.e. any other extremely slow backup system)

We had this similar wait on one of our SQL server 2005 for backing up the database approx 200 GB in size. As the size being large and we did not had compression feature available at that time, witness lots of waits. we had to go with third party compression tool and saw those wait types count going down.So i believe it could be a good start if you start checking on the backups. Make sure you use the SQL server compression feature for backing up the databases.

However just to check if this wait type is causing any problems, you need to check for any issues in backup throughput.

May be you can go around and test the backups by doing native backups on same network as compared to other network and check for other counter values.

Also, you can refer A cause of high-duration ASYNC_IO_COMPLETION waits for more explanation by Paul on this wait.

Related Solutions

Tuning SharePoint Site Using SQL Server Waits and Queues

As this is SharePoint your hands are tied. You can't create indexes or statistics without loosing support from Microsoft.

CXPACKET just means that one thread of a parallized query is waiting for another thread to finish waiting for something else. SOS_SCHEDULER_YIELD can mean a few things, typically it means that you need to fix your indexes are you have queries that aren't tuned correctly so the SQL Server is pausing those queries (yielding them) so that it can work on other stuff (as this is SharePoint you can't really do anything about this). OLEDB means that the SQL Server is waiting on the OLEDB driver so probably there are linked server queries or something along those lines that are calling out to another database server. There's nothing you can do about that either.

From what it sounds like there's about nothing you can do without throwing more hardware at the problem. Use Profiler, Server Side Tracing or Extended Events and find the queries which are taking to long to run and check their execution plans and see how bad that are. Odds are those search queries are doing large scan operations.

How big are the content databases? Microsoft recommends keeping the content databases under 100 Gigs so that scanning is kept to a minimum.

SQL Server – Intermittent RESOURCE_SEMAPHORE_QUERY_COMPILE Wait Stats

I believe you will see this symptom if you have a LOT of large query plans that are fighting for memory in order to compile (this has very little to do with running the query itself). To hit this, I suspect you are using an ORM or some kind of application that generates many unique but relatively complex queries. SQL Server could be under memory pressure because of things like large query operations, but on further thought it is more likely just that your system is configured with far less memory than it needs (either there is never enough memory to satisfy all of the queries you're trying to compile, or there are other processes on the box that are stealing memory from SQL Server).

You can take a look at what SQL Server is configured with using:

EXEC sp_configure 'max server memory';    -- max configured in MB

SELECT counter_name, cntr_value
  FROM sys.dm_os_performance_counters
  WHERE counter_name IN
  (
    'Total Server Memory (KB)',    -- max currently granted
    'Target Server Memory (KB)'    -- how much SQL Server wished it had
  );

You can identify the cached plans that required the most compile memory with the following Jonathan Kehayias query, adapted slightly:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

;WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT TOP (10) CompileTime_ms, CompileCPU_ms, CompileMemory_KB,
  qs.execution_count,
  qs.total_elapsed_time/1000.0 AS duration_ms,
  qs.total_worker_time/1000.0 as cputime_ms,
  (qs.total_elapsed_time/qs.execution_count)/1000.0 AS avg_duration_ms,
  (qs.total_worker_time/qs.execution_count)/1000.0 AS avg_cputime_ms,
  qs.max_elapsed_time/1000.0 AS max_duration_ms,
  qs.max_worker_time/1000.0 AS max_cputime_ms,
  SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
    (CASE qs.statement_end_offset
      WHEN -1 THEN DATALENGTH(st.text) ELSE qs.statement_end_offset
     END - qs.statement_start_offset) / 2 + 1) AS StmtText,
  query_hash, query_plan_hash
FROM
(
  SELECT 
    c.value('xs:hexBinary(substring((@QueryHash)[1],3))', 'varbinary(max)') AS QueryHash,
    c.value('xs:hexBinary(substring((@QueryPlanHash)[1],3))', 'varbinary(max)') AS QueryPlanHash,
    c.value('(QueryPlan/@CompileTime)[1]', 'int') AS CompileTime_ms,
    c.value('(QueryPlan/@CompileCPU)[1]', 'int') AS CompileCPU_ms,
    c.value('(QueryPlan/@CompileMemory)[1]', 'int') AS CompileMemory_KB,
    qp.query_plan
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
CROSS APPLY qp.query_plan.nodes('ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS n(c)
) AS tab
JOIN sys.dm_exec_query_stats AS qs ON tab.QueryHash = qs.query_hash
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY CompileMemory_KB DESC
OPTION (RECOMPILE, MAXDOP 1);

You can see how the plan cache is being used with the following:

SELECT objtype, cacheobjtype,
    AVG(size_in_bytes*1.0)/1024.0/1024.0,
    MAX(size_in_bytes)/1024.0/1024.0,
    SUM(size_in_bytes)/1024.0/1024.0,
    COUNT(*)
FROM sys.dm_exec_cached_plans
GROUP BY GROUPING SETS ((),(objtype, cacheobjtype))
ORDER BY objtype, cacheobjtype;

When you are experiencing high semaphore waits, check to see if these query results vary significantly from during "normal" activity:

SELECT resource_semaphore_id, -- 0 = regular, 1 = "small query"
  pool_id,
  available_memory_kb,
  total_memory_kb,
  target_memory_kb
FROM sys.dm_exec_query_resource_semaphores;

SELECT StmtText = SUBSTRING(st.[text], (qs.statement_start_offset / 2) + 1,
        (CASE qs.statement_end_offset
          WHEN -1 THEN DATALENGTH(st.text) ELSE qs.statement_end_offset
         END - qs.statement_start_offset) / 2 + 1),
  r.start_time, r.[status], DB_NAME(r.database_id), r.wait_type, 
  r.last_wait_type, r.total_elapsed_time, r.granted_query_memory,
  m.requested_memory_kb, m.granted_memory_kb, m.required_memory_kb,
  m.used_memory_kb
FROM sys.dm_exec_requests AS r
INNER JOIN sys.dm_exec_query_stats AS qs
ON r.plan_handle = qs.plan_handle
INNER JOIN sys.dm_exec_query_memory_grants AS m
ON r.request_id = m.request_id
AND r.plan_handle = m.plan_handle
CROSS APPLY sys.dm_exec_sql_text(r.plan_handle) AS st;

And you may also want to look and see how memory is distributed:

DBCC MEMORYSTATUS;

And there is some good information here about why you might be seeing a high number of compiles/recompiles (which will contribute to that wait):

http://technet.microsoft.com/en-us/library/ee343986(v=sql.100).aspx

http://technet.microsoft.com/en-us/library/cc293620.aspx

You can check for high compile/recompile counts using the following counters:

SELECT counter_name, cntr_value
  FROM sys.dm_os_performance_counters
  WHERE counter_name IN 
  (
    'SQL Compilations/sec',
    'SQL Re-Compilations/sec'
  );

And you can check for internal memory pressure leading to evictions - non-zero counters here would indicate that something not good is going on with the plan cache:

SELECT * FROM sys.dm_os_memory_cache_clock_hands 
  WHERE [type] IN (N'CACHESTORE_SQLCP', N'CACHESTORE_OBJCP');

NOTE Most of these metrics don't have a magic "oh my gosh I need to panic or do something!" threshold. What you need to do is to take measurements during normal system activity, and determine where these thresholds are for your hardware, configuration and workload. When you ~~panic~~ do something is when two conditions are true:

the metrics vary significantly from normal values; and,
there is actually a performance problem occurring (like your CPU spikes) - but only if they are actually interfering with anything. Other than seeing the CPUs spike, are you seeing any other symptom? In other words, is the spike the symptom, or is the spike causing other symptoms? Would users of the system ever notice? A lot of people always go after their highest wait consumer, simply because it's the highest. Something is always going to be the highest wait consumer - you have to know that it's varying enough from normal activity that it indicates a problem or some significant change.

Optimize for ad hoc workloads is a great setting for 99% of the workloads out there, but it will not be very helpful in reducing compilation costs - it is aimed at reducing plan cache bloat by preventing a single-use plan from storing the whole plan until it's been executed twice. Even when you only store the stub in the plan cache, you still have to compile the full plan for the execution of the query. Perhaps what @Kahn meant to recommend was setting the database level parameterization to forced, which will potentially provide better plan re-use (but it really depends on how unique all of these high-cost queries are).

Also some good information in this white paper about plan caching and compilation.

Best Answer

Related Solutions

Tuning SharePoint Site Using SQL Server Waits and Queues

SQL Server – Intermittent RESOURCE_SEMAPHORE_QUERY_COMPILE Wait Stats

Related Question