Sql-server – HADR high worker thread usage

availability-groupshadrsql serversql server 2014

Why would an availability group's number of worker threads in an HADR pool increase well beyond minimum usage of "typically, there are 3–10 shared threads" per replica?

In one case we've observed usage of 300+ threads with 3 availability groups and 10 databases total. SQL Server 2014 SP1.

Our leads are backup on secondary replica, high activity on primary replica, reports on secondary replica.

The AGs are in a datacenter on VMware. 16 schedulers total, usual worker threads are under 200 range. max_dop on server is 2.

  • 3 AG, 10 DB, 4 replica each – primary, 2 readonly, 1 not readable.
  • 1 secondary is synch, 2 async
  • 16 vcores on 32 cores physical on large multi host cluster.
  • No overprovision.
  • Other smaller VMs 4-8 cores are colocated, but they don't press on CPU

We observed a spike in worker threads resulting in denial of service. Attribution of worker threads to AG is our assumption, as only those worker threads can cross the limit.

Below links from the SQL Server Premier Field Engineer Blog read in context don't give a complete answer to me:

Best Answer

Since your DC is on VM, I suspect you are experiencing poor disk performance. Poor disk performance can result in slower log-write times on the secondary which can result in a slower acknowledgment back to the primary replica from the secondary replica (exhausting worker threads).

Disk latency on Secondary Replica can cause an increase in HADR Sync Commit process, resulting in the Primary holding open threads while waiting for the Secondary to acknowledge the transaction.

Please check the error log for Deadlocked Schedulers and collect some IO metrics from PerfMon to see the disk latency and the disk queue length.