SQL Server Restarted with Pending Task Count Increased – Troubleshooting

performancesql serversql-server-2012

We are trying to understand the issue where SQL server was suddenly restarted on its own-

Its a SQL server 2012 SP4 -GDR applied with 40 CPu's HT enabled , thus total of 80 logical processors-

MAXDOP=8 , CTOP 5

From error logs we found

/**********************/

BEGIN stack Dump

Non- Yielding Scheduler /**********************/

Error messages just at the time of restart of SQL. Also there was no mini dump created as checked.

Yes we have quite some queries running at that time. Top 3 waits seen were:-

1.TranLogIO
2.CXPACKET
3.PAGELATCH SH

However we also noticed wait called SOS WORKER as collected from system health XE which i believe is nothing else but THREADPOOL. Therefore i went further and analyse the query processing details from that system health and found at the time of non yielding scheduler errors below was logged

At 19:46 maxworkers– 2944 workers created 789 oldest pending taskwait time 0 pending task 4

At 19:51 maxworkers– 2944 workers created 982 oldest pending taskwait time 256987 pending task 165

At 19:51 we started seeing restart happened.

Question is why there would be a threadpool wait if almost 2000 workers are still available from above calculation. Also why those pending task count 165 when queries have so many schedulers available to run and complete the request for bunch of queries waiting on CXPACKETS?

Edit: Updating my question for couple of waits also seen from XE

SOS_MEMORY_TOPLEVELBLOCKALLOCATOR

As i am reading info on above wait here https://www.sqlskills.com/help/waits/sos_memory_toplevelblockallocator/

Based on the blog there is a fix in SP3 CU but with TF T8075 : I am currently on SP4-GDR latest patch, do i still need to apply the TF, though i do not see any messages like Failed allocate pages: FAIL_PAGE_ALLOCATION 513 in error log

Not sure if it helps- i see total server memory dropping a gig now and then and then matching upto target server memory which is = max server memory (750GB)

Most of the times total=target=max server memory. The drops are just by a gig and not much

Memory info= total ram: 880 GB Max memory : 750GB Min server memory: 130 GB
Its a 2 node windows cluster and no other SQL instance is shared. Resource governor is not enabled

Thanks

Best Answer

I would start by saying that best person to analyze dump in MS guy or person who knows about it. I would just try to point out some basics from the log you posted

At 19:46 maxworkers- 2944 workers created 789 oldest pending taskwait time 0 pending task 4 At 19:51 maxworkers- 2944 workers created 982 oldest pending taskwait time 256987 pending task 165

Please see that taskwait time 256987 and pending task 165. This means the scheduler was hung and 165 tasks were waiting on it to get a scheduler and run. In this case MS SQL Server was not able to get out of this hung scheduler scenario and waited for a while but again it decided that it would be best to restart itself to come out of this hung scheduler scenario and hence it restarted. Now for reason why it hang is beyond my ability to tell with the information you have posted.

Also note a thread is assigned a scheduler and it should run on that scheduler and that is why so many threads were waiting on this hung scheduler.

Related Solutions

How to Collect Wait Stats in SQL Server 2012

SQL Server works in Non preemtive mode which means that if SQLOS asks it to yeild because it got request from windows OS it will ask SQL server to yeild and SQL Server will listen to it and will yeild or will do as SQLOS has asked it to do. This is because SQL server runs as application and is allocated resources by SQLOS which is monitored by windows O. Preemtive wait types occur when SQL server is executing a task and is interrupted by OS and asked to give up the thread it is using so that can be allocated for other tasks and SQL will do it. It will yeild and will wait till the thread is available this waiting will come under PREEMTIVE-XXX waits.

What is version fo SQL server here is it patched to latest Service pack there was a bug in SQL server 2008 which points incorrect value of preemtive wait types.

Can you run sys.dm_exec_requests DMV and see if the process getting preemtive wait types are suspended or running.

Can you please post output of below query(By Jonathan Kehayias) to capture wait stats

SELECT TOP 10
wait_type ,
max_wait_time_ms wait_time_ms ,
signal_wait_time_ms ,
wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms ,
100.0 * wait_time_ms / SUM(wait_time_ms) OVER ( )
AS percent_total_waits ,
100.0 * signal_wait_time_ms / SUM(signal_wait_time_ms) OVER ( )
AS percent_total_signal_waits ,
100.0 * ( wait_time_ms - signal_wait_time_ms )
/ SUM(wait_time_ms) OVER ( ) AS percent_total_resource_waits
FROM sys.dm_os_wait_stats
WHERE wait_time_ms > 0 -- remove zero wait_time
AND wait_type NOT IN -- filter out additional irrelevant waits
( 'SLEEP_TASK', 'BROKER_TASK_STOP', 'BROKER_TO_FLUSH',
'SQLTRACE_BUFFER_FLUSH','CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
'LAZYWRITER_SLEEP', 'SLEEP_SYSTEMTASK', 'SLEEP_BPOOL_FLUSH',
'BROKER_EVENTHANDLER', 'XE_DISPATCHER_WAIT', 'FT_IFTSHC_MUTEX',
'CHECKPOINT_QUEUE', 'FT_IFTS_SCHEDULER_IDLE_WAIT',
'BROKER_TRANSMITTER', 'FT_IFTSHC_MUTEX', 'KSOURCE_WAKEUP',
'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE',
'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BAD_PAGE_PROCESS',
'DBMIRROR_EVENTS_QUEUE', 'BROKER_RECEIVE_WAITFOR',
'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS',
'WAITFOR', 'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_JOIN',
'RESOURCE_QUEUE' )
ORDER BY wait_time_ms DES

What other processes are running on system is OS under load how many CPU cores does system have ?

EDIT: After user pasted output

The output does not gives and concrete picture use my query. Also are you running extended events traces because i can see XE wait types. I consider this wait type as harmful and it seems you are not facing a issue. Monitoring tools sometimes overreact so with result you posted I would just think its normal behavior.

EDIT2: I dont find any issue with the wait stats output you posted. I also assume your server was not restarted recently otherwise waits stats will not be useful.

Sql-server – Resource semaphore query compile waits

RESOURCE_SEMAPHORE_QUERY_COMPILE isn't aided by more memory in older versions of SQL Server. Since you're on 2014, you can use Trace Flag 6498 if you're on certain patch levels.

That TF increases the large query compile gateway dependent on the amount of memory in your server.

Best Answer

Related Solutions

How to Collect Wait Stats in SQL Server 2012

Sql-server – Resource semaphore query compile waits

Related Question