Sql-server – Availability Group Replica delay

availability-groupssql server

looking for ideas for diagnosing the root cause of the following issue. We have 3 availability nodes Primary,Synchronous replica, Asynchronous replica. no load other than T-Log backups exists on the Synchronous replica. At a certain time all replicas start lagging behind. The lag increases over a period of 6 hours. Data is still being written but very slowly. Suddenly the replicas start catching up and are back up to date within a period of around 30 minutes.
No errors in any of the SQL Server Error logs.
No blocking of the DB STARTUP process or any of the redo processes on either of the replicas (one has no users on it)
HADR_DATABASE_FLOW_CONTROL has moved into the top waits for the Primary
On Secondary top waits are normally
PARALLEL_REDO_WORKER_WAIT
REDO_THREAD_PENDING_WORK
PARALLEL_REDO_TRAN_TURN

During the issue this stayed much the same other than
PARALLEL_REDO_FLOW_CONTROL existed.

No evidence of a long running large transaction on the primary. Given that all replicas were affected it points to a problem with the primary. No index maintenance type jobs running. Can anyone think of any ddl,dml operations or other task that could cause this type of behavior?
thanks

Environment is all SQL2016 CU9, disks are local on all machines

Best Answer

Considering the presence of high HADR_DATABASE_FLOW_CONTROL waits on the primary, and the combination of PARALLEL_REDO_FLOW_CONTROL and PARALLEL_REDO_TRAN_TURN waits on the secondary, it sounds like parallel redo might not be a good fit for your workload.

Can anyone think of any ddl,dml operations or other task that could cause this type of behavior?

Microsoft calls out one example in their blog post on parallel redo:

PARALLEL_REDO_TRAN_TURN - Only happens in a readable secondary replica when new insert triggers page-split system transaction, or record update in a heap table generates a forwarded record

If your workload results in lots of page splits or forwarded records, then you should consider disabling parallel redo on the secondary by turning on documented trace flag 3459 (note: as a permanent solution, you will want to set this as a startup trace flag in SQL Server Configuration Manager):

DBCC TRACEON (3459,-1);

If you need to revert back to using parallel redo on your build, you will need to restart the SQL Server service after disabling the trace flag. This was fixed (to not require a restart) in SP1 CU10 and SP2 CU2. It doesn't look like it's fixed in the RTM branch. See this KB for details: https://support.microsoft.com/en-us/help/4339858/fix-parallel-redo-does-not-work-after-you-disable-trace-flag-3459-in-a

More broadly, there are several fixes for issues with parallel redo in SP1 and SP2. You should consider patching SQL Server to see if that resolves your issue as well.


If you want to confirm that page splits or forwarded fetches are actively occurring, try running sp_BlitzFirst:

EXEC dbo.sp_BlitzFirst;

If you want to confirm that your heaps have lots of forwarded records, you can use a DMV query like this one (note: replace Your_Table_Name with the name of a heap in your database):

SELECT 
    OBJECT_NAME(object_id) AS table_name, 
    forwarded_record_count,
    page_count
FROM sys.dm_db_index_physical_stats 
(
    DB_ID(), 
    OBJECT_ID(N'Your_Table_Name'), 
    DEFAULT, 
    DEFAULT, 
    'DETAILED'
);