Sql-server – Worker threads issue: How to read this crash dump, and what to do about it

Our production SQL Server froze up yesterday. Had to kill the windows process to get it to restart.

Microsoft SQL Server 2016 (SP1-CU1) (KB3208177) - 13.0.4411.0 (X64)
Jan 6 2017 14:24:37  
Copyright (c) Microsoft Corporation
Enterprise Edition (64-bit) on Windows Server 2012 R2 Standard 6.3 <X64>
(Build 9600: ) (Hypervisor)

The event viewer says this at the time of the crash:

New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 300 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 99%%.

Found this article (from 2010) that gives a few tips how to read the crash dumps. I looked at the crash dump from our server and found thousands of blocks that all look like this:

 2376  Id: 6bc.84c8 Suspend: 0 Teb: 00007ff7`f74d4000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 000000b5`5938bcd8 00007fff`d22e6d8e ntdll!NtSignalAndWaitForSingleObject+0xa
01 000000b5`5938bce0 00007fff`c1944b99 KERNELBASE!SignalObjectAndWait+0xc8
02 000000b5`5938bd90 00007fff`c19414dc sqldk!SOS_Scheduler::Switch+0x106
03 000000b5`5938c080 00007fff`c4c5185f sqldk!SOS_Scheduler::SuspendNonPreemptive+0xd3
04 000000b5`5938c0c0 00007fff`c4debd67 sqlmin!EventInternal<SuspendQueueSLock>::Wait+0x1e7
05 000000b5`5938c110 00007fff`c4debb61 sqlmin!LockOwner::Sleep+0x49a
06 000000b5`5938c210 00007fff`c4c57f1c sqlmin!lck_lockInternal+0xfd3
07 000000b5`5938cab0 00007fff`c4c5ef3a sqlmin!GetLock+0x1d9
08 000000b5`5938cb80 00007fff`c4ecc841 sqlmin!BTreeRow::AcquireLock+0x212
09 000000b5`5938cc90 00007fff`c4c5f20b sqlmin!IndexRowScanner::AcquireNextRowLock+0xf6
0a 000000b5`5938ccd0 00007fff`c4c774fa sqlmin!IndexDataSetSession::GetNextRowValuesInternal+0x12e6
0b 000000b5`5938cf60 00007fff`c4cb9dc2 sqlmin!RowsetNewSS::FetchNextRow+0x1d9
0c 000000b5`5938d080 00007fff`c56090b1 sqlmin!CQScanTableScanNew::GetRow+0xfa
0d 000000b5`5938d0f0 00007fff`c4e9592c sqlmin!CQScanSpoolNew::LoadSpool+0x51
0e 000000b5`5938d120 00007fff`c4c7b741 sqlmin!CQScanSpoolNew::Open+0xf3
0f 000000b5`5938d170 00007fff`c4cb1185 sqlmin!CQScanNew::OpenHelper+0x41
10 000000b5`5938d1a0 00007fff`c4c7b741 sqlmin!CQScanTopNew::Open+0x15
11 000000b5`5938d1d0 00007fff`c4c7b741 sqlmin!CQScanNew::OpenHelper+0x41
12 000000b5`5938d200 00007fff`c4db682d sqlmin!CQScanNew::OpenHelper+0x41
13 000000b5`5938d230 00007fff`c4c7b741 sqlmin!CQScanUpdateNew::Open+0x190
14 000000b5`5938d280 00007fff`c4db682d sqlmin!CQScanNew::OpenHelper+0x41
15 000000b5`5938d2b0 00007fff`c560909f sqlmin!CQScanUpdateNew::Open+0x190
16 000000b5`5938d300 00007fff`c4e9592c sqlmin!CQScanSpoolNew::LoadSpool+0x3f
17 000000b5`5938d330 00007fff`c560cc61 sqlmin!CQScanSpoolNew::Open+0xf3
18 000000b5`5938d380 00007fff`c4db6fe0 sqlmin!CQScanSequenceNew::Open+0xf1
19 000000b5`5938d410 00007fff`c26b0da9 sqlmin!CQueryScan::UncacheQuery+0x40b
1a 000000b5`5938d480 00007fff`c26b0e74 sqllang!CXStmtQuery::SetupQueryScanAndExpression+0x421
1b 000000b5`5938d500 00007fff`c26b6a67 sqllang!CXStmtQuery::InitForExecute+0x34
1c 000000b5`5938d530 00007fff`c26df2e9 sqllang!CXStmtQuery::ErsqExecuteQuery+0x4dc
1d 000000b5`5938d6b0 00007fff`c26df0ac sqllang!CXStmtDML::XretDMLExecute+0x3a3
1e 000000b5`5938d790 00007fff`c26b19ea sqllang!CXStmtDML::XretExecute+0xb0
1f 000000b5`5938d7c0 00007fff`c26b2973 sqllang!CMsqlExecContext::ExecuteStmts<1,1>+0x40d

According to the article linked above, it seems to match this pattern:

If it is blocking issue and If most of the threads are waiting to
acquire a lock you will find the most of the stack similar to one below. (We try
to acquire lock and go to wait, since someone is holding a lock)
ntdll!ZwSignalAndWaitForSingleObject
kernel32!SignalObjectAndWait
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::Suspend
sqlservr!SOS_Event::Wait
sqlservr!LockOwner::Sleep
sqlservr!lck_lockInternal
sqlservr!GetLock

OK, so now it seems we have a blocking issue. Now what? Where do we start looking for the root cause of this mess?

Here are the SQL logs from just before the crash:

03/28/2017 20:41:05,spid162,Unknown,The client was unable to reuse a session with SPID 162<c/> which had been reset for connection pooling. The failure ID is 23. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
03/28/2017 20:41:05,spid162,Unknown,Error: 18056<c/> Severity: 20<c/> State: 23.
03/28/2017 20:41:05,Logon,Unknown,Login failed for user 'NT SERVICE\SQLTELEMETRY'. Reason: Access to server validation failed while revalidating the login on the connection. [CLIENT: <local machine>]
03/28/2017 20:41:05,Logon,Unknown,Error: 18456<c/> Severity: 14<c/> State: 23.
03/28/2017 20:39:45,spid19s,Unknown,Service Broker manager has shut down.
03/28/2017 20:39:44,spid5s,Unknown,SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
03/28/2017 20:38:30,Server,Unknown,New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 900 seconds. Blocking or long-running queries can contribute to this condition<c/> and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads<c/> or optimize current running queries. SQL Process Utilization: 0%. System Idle: 96%.
03/28/2017 20:33:29,Server,Unknown,New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 600 seconds. Blocking or long-running queries can contribute to this condition<c/> and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads<c/> or optimize current running queries. SQL Process Utilization: 0%. System Idle: 95%.
03/28/2017 20:28:28,Server,Unknown,New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 300 seconds. Blocking or long-running queries can contribute to this condition<c/> and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads<c/> or optimize current running queries. SQL Process Utilization: 0%. System Idle: 99%.
03/28/2017 20:28:28,Server,Unknown,Timeout waiting for external dump process 13824.
03/28/2017 20:25:59,Server,Unknown,Stack Signature for the dump is 0x0000000000000097
03/28/2017 20:25:59,Server,Unknown,* Short Stack Dump
03/28/2017 20:25:59,Server,Unknown,* -------------------------------------------------------------------------------
03/28/2017 20:25:59,Server,Unknown,* *******************************************************************************
03/28/2017 20:25:59,Server,Unknown,*
03/28/2017 20:25:59,Server,Unknown,* Deadlocked Schedulers
03/28/2017 20:25:59,Server,Unknown,*
03/28/2017 20:25:59,Server,Unknown,*   03/28/17 20:25:59 spid 1808
03/28/2017 20:25:59,Server,Unknown,* BEGIN STACK DUMP:
03/28/2017 20:25:59,Server,Unknown,*
03/28/2017 20:25:59,Server,Unknown,* *******************************************************************************
03/28/2017 20:25:59,Server,Unknown,***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\LOG\SQLDump0002.txt
03/28/2017 20:25:59,Server,Unknown,**Dump thread - spid = 0<c/> EC = 0x0000000000000000

Best Answer

_{Community Wiki answer generated from comments on this and the previous question.}

sp_BlitzErik: Threadpool waits can happen for reasons other than just long running queries. Number of connections, number of simultaneous queries (especially if going parallel), and number of background tasks can also contribute, and that's just within SQL.

Enable the Remote DAC, run sp_WhoIsActive next time it happens. You may have to use @show_sleeping_spids to see connection pooling issues.

Having too few CPUs for your workload, incorrect parallelism settings, long blocking chains are common causes.

Shanky: For dump analysis you should contact MS support, as normal blocking does not cause stack dumps.

Deadlocked schedulers are always a SQL Server bug, which should be reported to Microsoft.

Best Answer

Related Solutions

Sql-server – LATCH_EX Waits on Resource METADATA_SEQUENCE_GENERATOR

SQL Server 2005 – Non-Yielding Scheduler Spinloop on Snapshot

Related Question