Diagnosing Slow Always On Commits in SQL Server 2016

Two node Always On availability group, one synchronous replica.

My synchronous replica falls out of sync very regularly. I see a pattern in which, when a log backup occurs on the secondary replica a short period of lag will occur during which the redo_queue_size fills up quickly, like so:

Looking at the guidance in the following link, it seems that my problem is mostly due to contention the redo thread experiences when trying to harden the transactions:

https://technet.microsoft.com/en-us/library/dn135335(v=sql.110).aspx

The replica falls further out of sync when a transaction log backup runs and this problem is exacerbated by reports running on the secondary replica also.

All the while, my transaction log backups are huge – on average 1.2GB but can be larger.

As far as I am aware, my log backups will be large because I have TDE enabled on the database, but I really didn't expect them to be this large. I suspect this is what is contributing the most to the slow commits on the secondary replica.

Are there any performance counters recommended to diagnose slow commits on a synchronous replica? What else can I do to validate my theory?

My problem seems to be the same as the one described here:
https://www.sqlservercentral.com/Forums/1871286/AlwaysOn-Missing-Redo-Thread

Can I just enable this trace flag on the secondary replica or does it need applying to both nodes?

EDIT: I checked the redo queue at 6am and found a huge number, with a recovery time of 15-20 minutes and increasing slightly all the time. I then applied the traceflag with DBCC TRACEON (3459, -1) and found after a few minutes, the number of redo commands dropped extremely quickly. So far the problem seems to have been mitigated by this traceflag but presumably this will put all transactions hardened to the secondary replica's log in single-threaded mode, like SQL Server 2014 and therefore, there is still the potential for the secondary replica to fall behind as a result of the non-parallel threading, when the primary is under heavy write load.

Best Answer

The problems I was experiencing:

Synchronous secondary replica falling behind semi-permanently
Huge log backups

These were solved by enabling trace flag 3459. In my case it was extremely easy to see that the flag immediately fixed the wait type of parallel_redo_flow_control dirty_page_table_lock parallel_drain_redo_worker and showed a rapid decrease in size of the redo queue.

I wonder why, in the bug report, Microsoft calls this "assertion": https://support.microsoft.com/en-us/help/3200975/fix-assertion-occurs-when-you-use-parallel-redo-in-a-secondary-replica

Credit to Jason AKA CirqueDeSQLeil from SQLServerCentral.com https://www.sqlservercentral.com/Forums/1871286/AlwaysOn-Missing-Redo-Thread

Best Answer

Related Solutions

Sql-server – Recovery time effects of Availability Groups

SQL Server – Always On DDL Operations Explained

Related Question