Sql-server – Availability Group database stays in Reverting for too long after a failover

availability-groupssql server

Architecture:
I have 2 Node Sync-Commit AlwaysOn configuration running on Multi-Subnet Failover Cluster. Primary node is in Europe and Secondary node is in US. I have only one database in the Availability Group which is OperationsManager db of SCOM.

  • Primary and Secondary Hosts are identical.
  • SQL Server Version on both boxes: 13.0.5237 & Windows Core
  • Update: I patched both servers to 10.0.5270.0 , it didn't help.
  • DB VLF count is only 27.

Problem:
When I initiate a failover, database fails over from Primary to Secondary node successfully in seconds. However, new secondary(old primary) database goes into Reverting / In Recovery phase and stays there for 30 minutes approximately.I also experienced same thing while failing back to original primary box so it's a problem which occurs both ways.

Findings:
I searched about this on internet and read documentation to investigate the issue. When role change from Primary to Secondary is finished, new secondary database goes through 3 phases:

Synchronization State: “NOT SYNCHRONIZING” ; Database State: ONLINE

Synchronization State: “NOT SYNCHRONIZING” ; Database State: RECOVERING

Synchronization State: “REVERTING” ; Database State: RECOVERING

In my case, all time has spent on last step. I also monitored the undo process by looking into perfmon counter "SQLServer:Database Replica Log remaining for undo"

I checked primary site before failover tests to spot any long running transactions or open transactions but couldn't find one. After failover, "Log remaining for undo" was around 30MB and it took 30 minutes for secondary database to go back to "Synchronized" state. When taking into consideration that we are running in Sync-Commit mode and there is a little workload on primary, it shouldn't take 30 minutes for redo phase imho.

SQL Server Error Log: I found this strange messages.

  • Remote harden of transaction 'RECEIVE MSG' (ID 0x000000004d52c65a 0001:01c4e415) started at Feb 22 2019 2:55PM in database 'OperationsManager' at LSN (2558:107841:1) failed.

  • Remote harden of transaction 'GhostCleanupTask' (ID 0x000000004d6d15aa 0001:01c4eaa0) started at Feb 22 2019 2:59PM in database 'OperationsManager' at LSN (2558:107843:46) failed.

Failover starts:
enter image description here

enter image description here

Failover ends:
enter image description here

All in all

Have you ever seen this problem before? Do you have any recommandations?

Best Answer

One thing to check whenever database recovery is running long, whether it's a normal restore or an AG failover, is your VLF count. Having lots of VLFs (thousands or tens of thousands), or VLFs of unusual size (one or two extremely large VLFs) will cause this process to slow to a crawl.

Run the following command on your the database in question:

USE YourDatabaseName;
GO

DBCC LOGINFO;

Note: if you're on SQL Server 2016 SP2 or newer, you can use this dynamic management function instead of the DBCC command: sys.dm_db_log_info

The number of rows that comes back is the number of VLFs you have. If that number is very large, or if the FileSize column shows extreme outliers among your VLFs, then you can likely resolve the slow recovery problem by (at a high level):

  1. shrinking the log file as small as possible
  2. growing it back out to its target size
  3. making sure autogrowth is set to a reasonable number based on your typical log growth rate and the frequency of your transaction log backups

The details of fixing VLF sizing issues have been covered extensively elsewhere, here is one example: A Busy/Accidental DBA’s Guide to Managing VLFs