looking for ideas for diagnosing the root cause of the following issue. We have 3 availability nodes Primary,Synchronous replica, Asynchronous replica. no load other than T-Log backups exists on the Synchronous replica. At a certain time all replicas start lagging behind. The lag increases over a period of 6 hours. Data is still being written but very slowly. Suddenly the replicas start catching up and are back up to date within a period of around 30 minutes.
No errors in any of the SQL Server Error logs.
No blocking of the DB STARTUP process or any of the redo processes on either of the replicas (one has no users on it)
HADR_DATABASE_FLOW_CONTROL has moved into the top waits for the Primary
On Secondary top waits are normally
PARALLEL_REDO_WORKER_WAIT
REDO_THREAD_PENDING_WORK
PARALLEL_REDO_TRAN_TURN
During the issue this stayed much the same other than
PARALLEL_REDO_FLOW_CONTROL existed.
No evidence of a long running large transaction on the primary. Given that all replicas were affected it points to a problem with the primary. No index maintenance type jobs running. Can anyone think of any ddl,dml operations or other task that could cause this type of behavior?
thanks
Environment is all SQL2016 CU9, disks are local on all machines
Best Answer
Considering the presence of high
HADR_DATABASE_FLOW_CONTROL
waits on the primary, and the combination ofPARALLEL_REDO_FLOW_CONTROL
andPARALLEL_REDO_TRAN_TURN
waits on the secondary, it sounds like parallel redo might not be a good fit for your workload.Microsoft calls out one example in their blog post on parallel redo:
If your workload results in lots of page splits or forwarded records, then you should consider disabling parallel redo on the secondary by turning on documented trace flag 3459 (note: as a permanent solution, you will want to set this as a startup trace flag in SQL Server Configuration Manager):
If you need to revert back to using parallel redo on your build, you will need to restart the SQL Server service after disabling the trace flag. This was fixed (to not require a restart) in SP1 CU10 and SP2 CU2. It doesn't look like it's fixed in the RTM branch. See this KB for details: https://support.microsoft.com/en-us/help/4339858/fix-parallel-redo-does-not-work-after-you-disable-trace-flag-3459-in-a
More broadly, there are several fixes for issues with parallel redo in SP1 and SP2. You should consider patching SQL Server to see if that resolves your issue as well.
If you want to confirm that page splits or forwarded fetches are actively occurring, try running sp_BlitzFirst:
If you want to confirm that your heaps have lots of forwarded records, you can use a DMV query like this one (note: replace Your_Table_Name with the name of a heap in your database):