SQL Server Distributed Availability Group Databases Not Syncing After Reboot

availability-groupsdistributed-availability-groupssql serversql-server-2017upgrade

We're getting ready to perform a large upgrade on our SQL Servers and are noticing some unusual behavior with Distributed Availability Groups that I'm trying to resolve before moving forward.

Last month, I upgraded a remote secondary server from SQL Server 2016 to SQL Server 2017. This server is a part of multiple Distributed Availability Groups (DAGs) and a separate Availability Group (AG). When we upgraded this server, we were unaware that it would get into an unreadable state, so during the past month we've solely been relying on the primary server.

As a part of the upcoming upgrade, I applied the CU 4 patch to the server and rebooted it. When the server came back online, the just-patched secondary showed all of the DAGs/AGs were syncing without any issues.

However, the primary was showing a very different story. It was reporting that

the separate AG was syncing without any issues
but the DAGs were in a Not Synchronzing / Not Healthy state

After initially panicking, I attempted the following things to get things synchronizing again in the DAGs:

From the primary, I stopped and resumed the data movement. This did not start syncing the data.
On the secondary (the one I just patched) I ran ALTER DATABASE [<database] SET HADR RESUME; – which execute without errors, but did not resume any syncing

My last attempt at syncing the data again was to login to the secondary, and manually restart the SQL Server service. Manually restarting the service seems a bit extreme, as I'd expect the server being rebooted would have been enough.

Has anyone run into this issue where a DAG doesn't start syncing to a secondary after a reboot? If so, how was it resolved?

I checked both the SQL Server error log, and the event viewer on the secondary server, there was nothing out of the ordinary that I could see.

Best Answer

Please note, this is not a definitive answer but it's the best answer after chatting with Taryn.

However, the primary was showing a very different story. It was reporting that the separate AG was syncing without any issues but the DAGs were in a Not Synchronzing / Not Healthy state

If the individual databases and AGs underlying the distributed ag say they healthy and synchronizing, there is a good chance this is just a hiccup in the DMVs and/or SSMS dashboards. Since there was nothing in the errorlog to suggest the replica didn't connect or was in a disconnected state.

Unfortunately since the issue has resolved, it's hard to say exactly what it was... but in the future if this occurs for someone:

Check sys.dm_hadr_database_replica_states on all clusters looking for anything that isn't healthy. If all shows healthy, it's possible the DMV hasn't updated yet
If it's unhealthy check the errorlog/DMVs for connectivity issues (such as not being able to connect to the forwarder/global primary)
Dan's answer mentions issues that could arise from database startup - though in this case the instance can't be read so that most likely wasn't an issue but could be in your case
If the database is readable, smoke test with a dummy table/insert or ...
Extended event session using the DEBUG channel items sqlserver.hadr_dump_log_block or sqlserver.hadr_apply_log_block to see if the secondary is actually receiving/applying the log blocks or ...
Perfmon object SQLServer:Database Replica\Log Bytes Received/sec

If you're receiving data on that secondary but the distributed ag still shows not synchronizing or not healthy then I'd let it go for a bit to see if the DMV values change since it's obviously receiving and processing log blocks.

If, however, it isn't then we'll need to investigate further which is out of scope of the answer.

Related Solutions

SQL Server AlwaysOn – Fixing Not Synchronizing/In Recovery Mode After Upgrade

After poking around in SSMS for a while I noticed that on the secondary replica there was a pause icon next to the Availability Databases. The primary had shown both were "green", but there was an option on the secondary to Resume Data Movement. I resumed the first database, and immediately the In Recovery status message was removed. A minute later it changed from Not Synchronizing to Synchronized, and everything worked as expected.

Here is a screenshot of the AG Databases after I fixed "Patch", but before fixing the test database:

Note you can also use TSQL on the secondary to resume replication on multiple database at the same time:

ALTER DATABASE [Patch] SET HADR RESUME;
ALTER DATABASE [test] SET HADR RESUME;
GO

SQL Server 2017 – Understanding SQL Availability Groups

How far will/can the secondary replica be behind?

That depends on a number of factors. Network speed between the 2 nodes, disk speed on the secondary, volume of data being transmitted. Many of these KPIs are in the AG Overview dashboard (Right-lick the AG in SSMS and select 'show dashboard'). You can select a slew of different metrics including log send queue, log redo queue, estimated recovery time, estimated data loss, etc. As stated in the comments below from scsimon, when in asynchronous, the secondary isn't every truly 'caught up'. It will always show a 'synchronizing..' status and not 'synchronized'.

If the secondary replica goes offline for any reason will it catch completely up once it comes online?

Yes. Also worth noting that during the time where your secondary replica is unavailable/disconnected, you will not be able to back up any transactions from the log that haven't been sent to the secondary. In in sys.databases the column log_reuse_Wait_desc will be populated with 'AVAILABILITY REPLICA' as transactions cannot be flushed out unless they are committed on the secondary.

How long can the secondary be down before it will not catch up?

The secondary won't just 'not catch up'. If you take the secondary down (without removing it from the AG), log backups won't flush transactions out, log files will fill up and then you'll run out of room, causing any future transactions to fail. The secondary server's amount of time that it can be down is largely dependent on how much space you have and what your transactional volume is. Once the secondary comes back up, it will begin applying transactions that occurred while it was down. I've seen this take from a few minutes to several hours.

For brief outages/patching, leaving the secondary connected is fine, but if you're looking at a long duration outage, it may be easier to just remove the secondary replica and not worry about the hassle of log files filling up. Then you just re-initialize the DBs into the AG once the outage completes.

Best Answer

Related Solutions

SQL Server AlwaysOn – Fixing Not Synchronizing/In Recovery Mode After Upgrade

SQL Server 2017 – Understanding SQL Availability Groups

Related Question