Sql-server – Always ON synchronization issue

availability-groupssql-server-2017

I'm having a bad issue with Always ON on a SQL Server 2017 CU3 running on Windows Server 2016 with last cumulative updates (March 2018).
Availability Mode is set to Synchronous commit and when I reboot the secondary node some DBs (randomly) get in a "Not Synchronized" state and they can't go back to "Synchronized".
Everything looks OK, I have another SQL installation with same patching level and Os and it's working fine.
I don't have specific errors on the event viewer. Antivirus is disabled, the nodes are on the same network and they look ok.
I tried to raise the session timeout to 20s but this didn't solve the issue.
Do you have any idea or suggestion?

Best Answer

I had the same issue on SQL Server 2017 CU 11 (latest as of today) with Windows Server 2016. After rebooting the secondary replica, some of the databases in the AG show not synchronizing. I used ALTER database SET HARD RESUME, but it didn't work. The work around is to restart the SQL Server Services on the secondary replica. Then all databases show synchronized again.

Related Solutions

SQL Server Availability Groups – Mitigating Disconnected Transaction Log Effects

Can someone please confirm if this is correct?

Yes, you are 100% correct!

Is there any way we can mitigate this, ie. ensuring there's no index maintenance during this window or switching to aynschronous commit mode?

The mitigation is not with the mode that the replicas run it, it's removing that replica from the AG.

Does this pertain to my situation as described above and if so does this mean that the potential for the transaction to fill is mitigated by design?

Yes, that's what will happen if you change nothing. No, the transaction log will still fill up. Regardless of synchronous or asynchronous modes the primary must keep all log blocks that are required for the replica that is the furthest behind.

In your scenario, there are only two replicas a primary and a secondary. The secondary will be offline for a while. You've already correctly stated what will happen, the send queue will get larger as the replica is offline and stays that way. This happens no matter the mode.

The mitigation is to remove the replica that will be down for the extended period of time. This allows the primary to properly truncate the log with proper transaction log backups.

Once the secondary is back up, since it was kicked out of the AG all of the databases participating in that AG will be in a restoring state. To get back to a good steady state and bring the replica back into the AG, take the log backups that were run during the period the secondary replica was offline and apply them to the databases, leaving them in a restoring state (with no recovery). Once you get close, pause all transaction log backups on the primary and join the replica back into the AG. Resume the log backups on the primary.

Sql-server – Does SQL Server 2017 CU1 break clusterless availability groups

My division is is doing an upgrade from SQL Server 2008R2 with mirroring to SQL Server 2017 with clusterless availability groups.

So you're upgrading versions but REMOVING high availability and disaster recovery? Clusterless AGs are called "Read-Scale" AGs and do not give high availability and you can argue on the disaster recovery part...

The group could be failed over multiple times without issue. Ah, but add a second database and issues would arise on the failovers. One of the databases would invariably wind up in a not synchronizing state. No amount of fiddling could resurrect it.

I've been seeing this when a configuration only replica (came in CU1) isn't used with Read-Scale AGs that are being used to fail over. Read-Scale wasn't made to fail over and all that jazz, it was made to horizontally scale out read copies for intense read situations (or as a way to replica across Windows/Linux for migrations). I must reiterate, "clusterless" AGs are not made for HADR. If this is part of your use case, use WSFC or Pacemaker (Linux). Info on Configuration Only Replica.

A pertinent error message in the logger was "Failed to update Replica status due to exception 35222." This seems to be a message related to clusters, but since we are clusterless I was confused.

There should be an error directly before this, that's the actual error you want to look into. This doesn't have anything to do with clustering and is not a clustering error, it's a replica error.

After we uninstalled CU1 on both replicas, I was able to create the AG and add 22 databases (including the two original). Failovers were without issue

This goes back to pre-configuration only replicas for adding into Read-Scale replicas. Again, surprised you didn't run into a few different issues as Read-Scale isn't made for HADR.

On a side note, automatic seeding did not always work with multiple databases. The operation would fail with a "Seeding Check Message Timeout".

Seems unrelated, but you never know - could be a side effect of whatever was going on. Impossible to say at this point.

I thought CUs were going to be tested at the same level as SPs.

You're correct. Changes in behavior between CUs (much like also were involved in SPs) can and do happen given the newer model (even in the SP + CU days this happened). I'd be interested to see if the configuration replica solves your issue since it was specifically added in CU1 for metadata safety as metadata issues for the replicas can and did happen since Read-Scale again wasn't made for HADR.

Best Answer

Related Solutions

SQL Server Availability Groups – Mitigating Disconnected Transaction Log Effects

Sql-server – Does SQL Server 2017 CU1 break clusterless availability groups

Related Question