Sql-server – Does SQL Server 2017 CU1 break clusterless availability groups

availability-groupssql-server-2017

Background:

My division is is doing an upgrade from SQL Server 2008R2 with mirroring to SQL Server 2017 with clusterless availability groups. Until recently the testing has surfaced no issues or red flags. Then we installed CU1, had issues, uninstalled CU1 and the issues went away. The OS is server 2016 with the latest patches.

Observed behavior after CU1:

Using either SSMS or tsql we could create a 2 replica clusterless synchronous availability group, and add one database to it. The group could be failed over multiple times without issue. Ah, but add a second database and issues would arise on the failovers. One of the databases would invariably wind up in a not synchronizing state. No amount of fiddling could resurrect it. If I dropped and recreated the whole thing, it might be the other database that went to not syncing. A pertinent error message in the logger was "Failed to update Replica status due to exception 35222." This seems to be a message related to clusters, but since we are clusterless I was confused. After we uninstalled CU1 on both replicas, I was able to create the AG and add 22 databases (including the two original). Failovers were without issue. On a side note, automatic seeding did not always work with multiple databases. The operation would fail with a "Seeding Check Message Timeout". Dropping those databases from the AG and adding them one at a time was successful.

My question is:

Has anyone else experienced issues with clusterless AGs after CU1? If so, were you successful where I was not?

Comment/opinion:

I thought CUs were going to be tested at the same level as SPs. While I know that bugs creep in no matter how thorough the testing, having this happen on the first one is troubling. It will cause us to really stress test each CU before deploying, which will mean we do not deploy them as they come out. We will deploy them only when we think there is a need to. We are a small organization without a dedicated dba, and need to be selective about what take on.

Best Answer

My division is is doing an upgrade from SQL Server 2008R2 with mirroring to SQL Server 2017 with clusterless availability groups.

So you're upgrading versions but REMOVING high availability and disaster recovery? Clusterless AGs are called "Read-Scale" AGs and do not give high availability and you can argue on the disaster recovery part...

The group could be failed over multiple times without issue. Ah, but add a second database and issues would arise on the failovers. One of the databases would invariably wind up in a not synchronizing state. No amount of fiddling could resurrect it.

I've been seeing this when a configuration only replica (came in CU1) isn't used with Read-Scale AGs that are being used to fail over. Read-Scale wasn't made to fail over and all that jazz, it was made to horizontally scale out read copies for intense read situations (or as a way to replica across Windows/Linux for migrations). I must reiterate, "clusterless" AGs are not made for HADR. If this is part of your use case, use WSFC or Pacemaker (Linux). Info on Configuration Only Replica.

A pertinent error message in the logger was "Failed to update Replica status due to exception 35222." This seems to be a message related to clusters, but since we are clusterless I was confused.

There should be an error directly before this, that's the actual error you want to look into. This doesn't have anything to do with clustering and is not a clustering error, it's a replica error.

After we uninstalled CU1 on both replicas, I was able to create the AG and add 22 databases (including the two original). Failovers were without issue

This goes back to pre-configuration only replicas for adding into Read-Scale replicas. Again, surprised you didn't run into a few different issues as Read-Scale isn't made for HADR.

On a side note, automatic seeding did not always work with multiple databases. The operation would fail with a "Seeding Check Message Timeout".

Seems unrelated, but you never know - could be a side effect of whatever was going on. Impossible to say at this point.

I thought CUs were going to be tested at the same level as SPs.

You're correct. Changes in behavior between CUs (much like also were involved in SPs) can and do happen given the newer model (even in the SP + CU days this happened). I'd be interested to see if the configuration replica solves your issue since it was specifically added in CU1 for metadata safety as metadata issues for the replicas can and did happen since Read-Scale again wasn't made for HADR.

Related Solutions

SQL Server 2016 – Distributed Availability Group Direct Seeding FAILED, SQL Error

The AlwaysOn Professional blog has some general troubleshooting steps for direct seeding and also includes some details about trace flag 9567 to enable compression during seeding, but I didn't find any details about the SQL Error or Seeding Timeout.

We previously have had issues with large databases causing problems in availability groups, but this usually is resolved by applying the latest transaction logs from the primary against the replica.

In this case the databases were listed on the secondary availability group as recovering, so I tried applying the latest transaction log backups from the primary and then joining the database to the secondary availability group:

--Restore transaction logs from primary and stay in recovery mode. Multiple backup files may need to be restored from oldest to newest.
RESTORE LOG stackoverflow from disk = '\\Backups\SQL\_Trans\StackOverflow_AG\StackOverflow\StackOverflow_LOG_20170810_175400.trn' WITH NORECOVERY;

ALTER DATABASE stackoverflow SET HADR AVAILABILITY GROUP = [StackOverflow_RAG];
ALTER DATABASE stackoverflow SET HADR RESUME;

This worked for both of the failed databases and fixed the replication issues. Our reporting cluster now has all databases kept in sync from the primary availability group:

Sql-server – SQL Server 2017 Linux CU1 – MODIFY FILE encountered operating system error 31

I figure out the issue.

Problem: EXT3 files system is not supported.

https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup File System XFS or EXT4 (other file systems, such as BTRFS, are unsupported

Solution: Create an EXT4 file system.

Comment the line in /etc/fstab that mounts the ext3 filesystem

vi /etc/fstab

-mount logical volume

"#/dev/vgsqldata/lvsqldata /sqldata ext3 defaults 1 1"

:wq!

Reboot the server

reboot

Check the file system is not mounted

df -kh

Check the linear volume

vgscan

vgdisplay vgsqldata

lvdisplay -v /dev/vgsqldata/lvsqldata

Format the volume as ext4

mkfs.ext4 /dev/vgsqldata/lvsqldata

Mount

mkdir /sqldata

mount -t ext4 /dev/vgsqldata/lvsqldata /sqldata

df -kh

touch /sqldata/test.txt

ls -la /sqldata

rm -rf /sqldata/test.txt

Persist the mount

vi /etc/fstab

-mount logical volume

/dev/vgsqldata/lvsqldata /sqldata ext4 defaults 1 1

Change Owner

chown -R mssql:mssql /sqldata

ls -la /

"drwxr-xr-x. 5 mssql mssql 4096 Nov 7 10:43 sqldata"

Best Answer

Related Solutions

SQL Server 2016 – Distributed Availability Group Direct Seeding FAILED, SQL Error

Sql-server – SQL Server 2017 Linux CU1 – MODIFY FILE encountered operating system error 31

Related Question