Sql-server – How to debug a growing Redo Queue Size in the Availability Group Dashboard

availability-groupsreplicationsql serversql-server-2016transaction-log

The Redo Queue Size has been growing pretty big over the last couple of hours on my secondary replica's server (SQL16). There's nothing long running or heavy running that I've found over the last hour or so, so I'm not sure where this is coming from.

My index maintenance jobs ran last night which are pretty beefy and normally cause this "lag" in sync between the replicas but that usually clears up by the morning.

I log the Redo Queue Size and can see where it spiked through the night and then started coming down as per usual. The jobs finished hours ago, and the Redo Queue Size spiked to around 10 GB during the job and after the job finished came back down to around 1.5 GB, but now it's growing again even though nothing heavy has been running since.

Note: My secondary replica is setup as a synchronous replica.

Availability Group Dashboard

Best Answer

There is no way to "debug" this other than to open an incident with Microsoft for investigation. They may be able to get a series of process dumps or enable some more verbose logging to see if there is a hung thread, but you may have to leave it in the current condition for quite a while, so that is the downside.

Based on my experiences with the same symptoms, stopping and starting the SQL Server service on the secondary will resolve the issue (unless you have an underlying I/O issue that is causing it). It may take a while for the database to recover as it will have to play all of the transactions in the redo queue, and during this time, replication will be paused. You can track progress of database recovery in the error log, and once the database is recover, replication will resume.