SQL Server Restarted with Pending Task Count Increased – Troubleshooting

performancesql serversql-server-2012

We are trying to understand the issue where SQL server was suddenly restarted on its own-

Its a SQL server 2012 SP4 -GDR applied with 40 CPu's HT enabled , thus total of 80 logical processors-

MAXDOP=8 , CTOP 5

From error logs we found

/**********************/

BEGIN stack Dump

Non- Yielding Scheduler /**********************/

Error messages just at the time of restart of SQL. Also there was no mini dump created as checked.

Yes we have quite some queries running at that time. Top 3 waits seen were:-

1.TranLogIO
2.CXPACKET
3.PAGELATCH SH

However we also noticed wait called SOS WORKER as collected from system health XE which i believe is nothing else but THREADPOOL. Therefore i went further and analyse the query processing details from that system health and found at the time of non yielding scheduler errors below was logged

At 19:46 maxworkers– 2944 workers created 789 oldest pending taskwait time 0 pending task 4

At 19:51 maxworkers– 2944 workers created 982 oldest pending taskwait time 256987 pending task 165

At 19:51 we started seeing restart happened.

Question is why there would be a threadpool wait if almost 2000 workers are still available from above calculation. Also why those pending task count 165 when queries have so many schedulers available to run and complete the request for bunch of queries waiting on CXPACKETS?

Edit: Updating my question for couple of waits also seen from XE

SOS_MEMORY_TOPLEVELBLOCKALLOCATOR

As i am reading info on above wait here https://www.sqlskills.com/help/waits/sos_memory_toplevelblockallocator/

Based on the blog there is a fix in SP3 CU but with TF T8075 : I am currently on SP4-GDR latest patch, do i still need to apply the TF, though i do not see any messages like Failed allocate pages: FAIL_PAGE_ALLOCATION 513 in error log

Not sure if it helps- i see total server memory dropping a gig now and then and then matching upto target server memory which is = max server memory (750GB)

Most of the times total=target=max server memory. The drops are just by a gig and not much

Memory info= total ram: 880 GB Max memory : 750GB Min server memory: 130 GB
Its a 2 node windows cluster and no other SQL instance is shared. Resource governor is not enabled

Thanks

Best Answer

I would start by saying that best person to analyze dump in MS guy or person who knows about it. I would just try to point out some basics from the log you posted

At 19:46 maxworkers- 2944 workers created 789 oldest pending taskwait time 0 pending task 4 At 19:51 maxworkers- 2944 workers created 982 oldest pending taskwait time 256987 pending task 165

Please see that taskwait time 256987 and pending task 165. This means the scheduler was hung and 165 tasks were waiting on it to get a scheduler and run. In this case MS SQL Server was not able to get out of this hung scheduler scenario and waited for a while but again it decided that it would be best to restart itself to come out of this hung scheduler scenario and hence it restarted. Now for reason why it hang is beyond my ability to tell with the information you have posted.

Also note a thread is assigned a scheduler and it should run on that scheduler and that is why so many threads were waiting on this hung scheduler.