Sql-server – Troubleshooting SOS_SCHEDULER_YIELD wait

performancesql serversql-server-2012wait-types

Running our corporate ERP (Dynamics AX 2012), I noticed our production environment seemed much slower than our development systems.

After performing the same activities in both the development and production environments while running a trace, I confirmed that SQL queries were executing very slowly on our
production environment compared to development (10-50x slower on average).

At first I attributed this to load, and re-ran the same activities on the production environment during off hours and found the same results in the trace.

I cleared my wait stats in SQL Server then let the server run under its normal production load for a little while, and then ran this query:

WITH [Waits] AS
    (SELECT
        [wait_type],
        [wait_time_ms] / 1000.0 AS [WaitS],
        ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
        [signal_wait_time_ms] / 1000.0 AS [SignalS],
        [waiting_tasks_count] AS [WaitCount],
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_wait_stats
    WHERE [wait_type] NOT IN (
        N'CLR_SEMAPHORE',    N'LAZYWRITER_SLEEP',
        N'RESOURCE_QUEUE',   N'SQLTRACE_BUFFER_FLUSH',
        N'SLEEP_TASK',       N'SLEEP_SYSTEMTASK',
        N'WAITFOR',          N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
        N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
        N'XE_TIMER_EVENT',   N'XE_DISPATCHER_JOIN',
        N'LOGMGR_QUEUE',     N'FT_IFTS_SCHEDULER_IDLE_WAIT',
        N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT',
        N'CLR_AUTO_EVENT',   N'DISPATCHER_QUEUE_SEMAPHORE',
        N'TRACEWRITE',       N'XE_DISPATCHER_WAIT',
        N'BROKER_TO_FLUSH',  N'BROKER_EVENTHANDLER',
        N'FT_IFTSHC_MUTEX',  N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        N'DIRTY_PAGE_POLL',  N'SP_SERVER_DIAGNOSTICS_SLEEP')
    )
SELECT
    [W1].[wait_type] AS [WaitType],
    CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S],
    CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S],
    CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S],
    [W1].[WaitCount] AS [WaitCount],
    CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage],
    CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S],
    CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S],
    CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
    [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold

My results are as follows:

WaitType               Wait_S  Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
SOS_SCHEDULER_YIELD   4162.52        3.64   4158.88    4450085       77.33     0.0009    0.0000    0.0009
ASYNC_NETWORK_IO       457.98      331.59    126.39     351113        8.51     0.0013    0.0009    0.0004
PAGELATCH_EX           252.94        5.14    247.80     796348        4.70     0.0003    0.0000    0.0003
WRITELOG               166.01       48.01    118.00     302209        3.08     0.0005    0.0002    0.0004
LCK_M_U                145.47      145.45      0.02        123        2.70     1.1827    1.1825    0.0002

So seemingly the largest Wait is SOS_Scheduler_Yield by far, and I googled around and found it typically relates to the CPU not being able to keep up.

I then ran this query multiple times in succession.

SELECT *
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255

I know I'm supposed to be looking for schedulers with non-zero runnable_tasks_count or pending_disk_io_count, but it's basically zero almost all the time.

I should also mention that Max Degree of Parallelism was set to 1, since the Dynamics AX workload is typically OLTP in nature, and changing it 8 did not make much of difference in the above wait stats, they became almost the exact same with the same performance problems.

I'm sort of at a loss of where to go from here, I basically have a SQL Server that is seemingly CPU strapped but not waiting on runnable_tasks or IO.

I do know that the IO subsystem of this SQL Server isn't very good, because running SQLIO on the drive containing the actual databases can lead to pretty low numbers (think 10MB a sec for certain types of reads/write), that said, it doesn't seem like SQL is waiting on that because of the amount of memory on the server caching most of the databases.

Here is some environment information to help:

Production environment:

  • SQL Server
  • HP ProLian DL360p Gen8
  • Intel Xeon E5-2650 0 @ 2.00GHz x 2 with hyperthreading (32 logical cores)
  • 184GB memory
  • Windows Server 2012
  • 2 instances of SQL Server 2012 Standard (RTM, unpatched)
  • Raid 1 279GB drives (15k) C: drive, contains databases and operating system
  • Page File and TempDB on distinct, separate drives (solid state)

My DEV:

  • Hyper-V hosted SQL Server and Dynamics AX 2012 AOS server
  • Core i7 3.4ghz with hyperthreading (8 logical cores)
  • 8GB of memory
  • Windows Server 2008 R2
  • SSD for the entire VM.

I would welcome any input on other things to look for.

Best Answer

So I resolved this, turns out that power management features were enabled on our SQL server that were scaling the CPU frequency up and down, but not fast enough to keep up with the small demand and introduced the SOS_Scheduler_Yield wait. After changing it to run always in high performance the issue went away and now the waits are more normal (LatchIO type stuff).