Sql-server – SQL Server 2017 with 500 databases – Frequent AG disconnects since CU9

availability-groupssql serversql-server-2017

Hi everyone and thanks in advance for your help.
We are experiencing challenges with SQL Server 2017 Availability Groups.

Background

Company is a retail B2B back-end software. About 500 single tenant databases, and 5 shared databases used by all tenants. Workload characteristic is read mostly, and the majority of databases have very low activity.

Physical production servers hosted at co-location were recently upgraded from SQL Server 2014 Enterprise on Windows Server 2012 in a shared SAN / FCI configuration, to SQL Server 2017 Enterprise on Windows Server 2016 on a 2 socket / 32 core / 768 GB RAM and local SSD drives using AlwaysOn AG. AG traffic uses dedicated 10G NIC ports with a crossed cable connection.

Their requirement is for all databases to failover together, so they had to put them all in a single AG. It's a single, non-readable synchronous replica on an identical server.

The new servers have been in production since June 2018. Latest CU (CU7 at the time) and windows updates were installed, and system was working well. About a month later, after updating the servers from CU7 to CU9, they started noticing the following challenges, listed in order of priority.

We have been monitoring the servers using SQL Sentry and observed no physical bottlenecks. All key indicators seem good. CPU is averaging 20%, IO times typically less than 1ms, RAM not fully utilized, and network <1%.

Challenges

The symptoms seem to get better after failover, but come back within a few days, regardless of which server is primary – symptoms are identical on both servers.

  1. Sporadic client time outs and connectivity failures such as

    …error occurred while establishing connection…

    or

    Execution timeout expired

    Sometimes these will go on for up to 40 seconds, and then subside.

  2. Transaction log backup job takes 10X longer to complete than before. Previously it took 2 – 3 minutes to back up the logs of all 500 databases, now it takes 15-25. We have verified that that Backup itself runs fine with good throughput. However, there is a small delay after completing the backup of one log, and before starting the next. it starts off very low, but within a day or two gets to 2-3 seconds. Multiplied by 500 databases, and there is the difference.

  3. Occasionally, some seemingly random databases get stuck in "Not synchronizing" state after manual failover. The only way to resolve this is to either restart the SQL Server Service on the secondary replica, or to remove and rejoin these databases to the AG.

  4. Another issue introduced by CU10 (and not resolved in CU11): Connections to secondary timeout on blocking on master.sys.databases and even unable to use SSMS object explorer for secondary replica. The root cause seems to be blocking by Microsoft SQL Server VSS writer issuing the following query:

    select name, 
           recovery_model_desc, 
           state_desc, 
           CONVERT(integer, is_in_standby), 
           ISNULL(source_database_id,0) 
      from master.sys.databases
    

Observations

I believe I found the smoking gun in the error logs. The error logs are full of AG messages, that are labeled as 'informational only', but looks like they are not normal at all, and there is a very strong correlation of their frequency to the application errors.

The errors are of several types, and come in sequences:

  • DbMgrPartnerCommitPolicy::SetSyncState: GUID

  • DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: GUID

  • AlwaysOn Availability Groups connection with secondary database terminated for primary database 'XYZ' on the availability replica 'DB' with Replica ID: {GUID}. This is an informational message only. No user action is required.

  • AlwaysOn Availability Groups connection with secondary
    database established for primary database 'ABC' on the availability
    replica 'DB' with Replica ID: {GUID}. This is an informational
    message only. No user action is required.

Some days there are 10's of thousands of those.

This article discusses the same type of sequence of errors on SQL 2016 and there it says it is abnormal. This also explains the 'non-synchronizing' phenomenon after failover. The issue discussed was for 2016 and was fixed earlier this year in a CU. however, it is the only relevant reference that I could find for the first 2 types of messages, other than references to automatic initial seeding messages which should not be the case here as the AG is already established.

Here is a summary of the daily errors last week, for days that had > 10K errors per type on the PRIMARY (secondary shows 'losing connection with primary…'):

Date        Message Type (First 50 characters)                  Num Errors
10/8/2018   DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:  61953
10/3/2018   DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:  56812
10/4/2018   DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:  27951
10/2/2018   DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:  24158
10/7/2018   DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:  14904
10/8/2018   Always On Availability Groups connection with seco  13301
10/3/2018   DbMgrPartnerCommitPolicy::SetSyncState: 783CAF81-4  11057
10/3/2018   Always On Availability Groups connection with seco  10080

We also occasionally see "weird" messages such as:

The availability group database "DB" is changing roles from "SECONDARY" to "SECONDARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

… among a host of changing states from "SECONDARY" to "RESOLVING".

After manual failover, the system may go for several days without a single message of these types, and all of the sudden, for no apparent reason, we will get thousands at once, which in turn causes the server to become unresponsive, and cause application connection timeouts. This is a critical bug as some of their applications do not incorporate a retry mechanism, and therefore may lose data.
When such a burst of errors occurs, the following wait types sky-rocket. This shows the waits right after AG seems to have lost connection to all databases at once:

Waits when severe burst of AG errors occur

About 30 seconds later, everything goes back to normal in terms of waits, but the AG messages keep flooding the error logs at varying rates and during different times of the day, seemingly random times including off peak hours. Concurrent increase in workload during these error bursts makes things worse of course. If only a few databases get disconnected, it typically doesn’t cause connections to time out as it is resolved quickly enough on its own.

We tried to verify that it was indeed CU9 that started the issue, but we were able to downgrade both nodes only to CU9. Attempts to downgrade either node to CU8, resulted in that node getting stuck in ‘Resolving’ state showing the same error in the log:

Cannot read the persisted configuration of Always On availability group with corresponding resource ID '…. The persisted configuration is written by a higher-version SQL Server that hosts the primary availability replica. Upgrade the local SQL Server instance to allow the local availability replica to become a secondary replica.

This means we will have to introduce down time to be able to downgrade both nodes to CU8 at the same time. This also suggests there was some major update to AG which may explain what we are experiencing.

We already tried adjusting the max_worker_threads from it's default of 0 (=960 on our box based on this article) gradually up to 2,000 with no observed impact on the errors.

What can we do to solve these AG disconnects?
Is anyone out there experiencing similar issues?
Can other people with large number of databases in an AG perhaps see similar messages in the SQL error log starting with CU9 or CU8?

Thanks in advance for any help!

Best Answer

Update:

  1. The Frequent Availability Group disconnects were confirmed to be a regression that was introduced by CU9 and they were resolved after installing CU12.
  2. The blocking issues on the secondary replica were confirmed to be an issue with an update to the VSS writer code that was introduced in CU10. Hopefully it will be resolved in CU 13. The interim solution is to manually replace the VSS writer DLLs with the Pre-CU10 DLLs...

    BEGIN RANT-SACTION;
    

    Unfortunately, Microsoft seem to be repeatedly failing to properly QA not only Windows 10 updates, but enterprise mission critical software such as SQL Server as well.

    I much preferred their previous strategy of service packs, at least they had enough time to test them properly before inflicting production crisis and data loss to their customers with careless release of half baked updates.

    COMMIT RANT-SACTION;