Sql-server – Does SQL Server 2017 CU1 break clusterless availability groups

availability-groupssql-server-2017

Background:

My division is is doing an upgrade from SQL Server 2008R2 with mirroring to SQL Server 2017 with clusterless availability groups. Until recently the testing has surfaced no issues or red flags. Then we installed CU1, had issues, uninstalled CU1 and the issues went away. The OS is server 2016 with the latest patches.

Observed behavior after CU1:

Using either SSMS or tsql we could create a 2 replica clusterless synchronous availability group, and add one database to it. The group could be failed over multiple times without issue. Ah, but add a second database and issues would arise on the failovers. One of the databases would invariably wind up in a not synchronizing state. No amount of fiddling could resurrect it. If I dropped and recreated the whole thing, it might be the other database that went to not syncing. A pertinent error message in the logger was "Failed to update Replica status due to exception 35222." This seems to be a message related to clusters, but since we are clusterless I was confused. After we uninstalled CU1 on both replicas, I was able to create the AG and add 22 databases (including the two original). Failovers were without issue. On a side note, automatic seeding did not always work with multiple databases. The operation would fail with a "Seeding Check Message Timeout". Dropping those databases from the AG and adding them one at a time was successful.

My question is:

Has anyone else experienced issues with clusterless AGs after CU1? If so, were you successful where I was not?

Comment/opinion:

I thought CUs were going to be tested at the same level as SPs. While I know that bugs creep in no matter how thorough the testing, having this happen on the first one is troubling. It will cause us to really stress test each CU before deploying, which will mean we do not deploy them as they come out. We will deploy them only when we think there is a need to. We are a small organization without a dedicated dba, and need to be selective about what take on.

Best Answer

My division is is doing an upgrade from SQL Server 2008R2 with mirroring to SQL Server 2017 with clusterless availability groups.

So you're upgrading versions but REMOVING high availability and disaster recovery? Clusterless AGs are called "Read-Scale" AGs and do not give high availability and you can argue on the disaster recovery part...

The group could be failed over multiple times without issue. Ah, but add a second database and issues would arise on the failovers. One of the databases would invariably wind up in a not synchronizing state. No amount of fiddling could resurrect it.

I've been seeing this when a configuration only replica (came in CU1) isn't used with Read-Scale AGs that are being used to fail over. Read-Scale wasn't made to fail over and all that jazz, it was made to horizontally scale out read copies for intense read situations (or as a way to replica across Windows/Linux for migrations). I must reiterate, "clusterless" AGs are not made for HADR. If this is part of your use case, use WSFC or Pacemaker (Linux). Info on Configuration Only Replica.

A pertinent error message in the logger was "Failed to update Replica status due to exception 35222." This seems to be a message related to clusters, but since we are clusterless I was confused.

There should be an error directly before this, that's the actual error you want to look into. This doesn't have anything to do with clustering and is not a clustering error, it's a replica error.

After we uninstalled CU1 on both replicas, I was able to create the AG and add 22 databases (including the two original). Failovers were without issue

This goes back to pre-configuration only replicas for adding into Read-Scale replicas. Again, surprised you didn't run into a few different issues as Read-Scale isn't made for HADR.

On a side note, automatic seeding did not always work with multiple databases. The operation would fail with a "Seeding Check Message Timeout".

Seems unrelated, but you never know - could be a side effect of whatever was going on. Impossible to say at this point.

I thought CUs were going to be tested at the same level as SPs.

You're correct. Changes in behavior between CUs (much like also were involved in SPs) can and do happen given the newer model (even in the SP + CU days this happened). I'd be interested to see if the configuration replica solves your issue since it was specifically added in CU1 for metadata safety as metadata issues for the replicas can and did happen since Read-Scale again wasn't made for HADR.