SQL Server 2016 – Max Worker Thread Exceeded After Manual AG Failover

availability-groupssql-server-2016troubleshooting

We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.

The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.

Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.

Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.

After that everything was back online, we started analyzing the log files.
In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.

Next was the AlwaysON_health event file, here the entries just stopped after the failover.

Next we took a look at the *_*_SQLDIAG_*_*.xel file. Here we were lucky.

The first thing we noticed was:

<queryProcessing maxWorkers="960" workersCreated="1064"
 workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
 oldestPendingTaskWaitingTime="1316776"

For some reason the max number of worker was exceeded, this could explain why no one could connect.

This were the pending tasks:

<pendingTasks><br>
   <entryPoint name="Process Command" count="14" /><br>
   <entryPoint name="SNI New Connection" count="2" /><br>
   <entryPoint name="SNI Accept Done" count="3" /><br>
   <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
   <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
   <entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" />  <br>
</pendingTasks>

There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5,DATABASE: 5:15,DATABASE: 5:26, …

All of those processes had lastbatchstarted and lastbatchcompleted set to 1900-01-01. The waittime was equal to how long the failover had happened.
In our monitoring we saw there was a total of 950 blocked processes.

This is a sample of a blocked-process-report:

<blocked-process-report monitorLoop="0">
 <blocked-process>
  <process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack />
   <inputbuf>
   </inputbuf>
  </process>
 </blocked-process>
 <blocking-process>
  <process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack />
   <inputbuf>
   </inputbuf>
  </process>
 </blocking-process>
</blocked-process-report>
<blocked-process-report monitorLoop="0">
 <blocked-process>
  <process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack />
   <inputbuf>
   </inputbuf>
  </process>
 </blocked-process>
 <blocking-process>
  <process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
   <executionStack />
   <inputbuf>
   </inputbuf>
  </process>
 </blocking-process>
</blocked-process-report>

One of the blocking-processes was executed as user SA and had command UNKNOWN TOKEN. The last_wait_type of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK and had a last_request_start_time that was equal to the time we did the failover.
I didn't expect to find this wait type on the primary.
Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?

Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.

Can someone explain what those pending tasks are?
My guess of the root cause is the 1000 sessions that had the status SLEEPING, they all used a worker.
This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
Can this be correct?

Best Answer

Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.

Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.

For some reason the max number of worker was exceeded, this could explain why no one could connect.

Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.

Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?

Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.

Can someone explain what those pending tasks are?

The Process Commands are attempting to execute something on a connection. The SNI* are new connections being made.

There was also a blocking-process, after checking our monitoring, this process was executed as user SA and had command UNKNOWN TOKEN.

How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.

Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900. That doens't seem normal at all.

Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.

Can this be correct?

It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.

In our monitoring we saw there was a total of 950 blocked processes.

There's your underlying problem. Looks like clientapp="our_app" was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.