Sql-server – Unexpected Always On Availability groups failover happening— databases changing to resolving

availability-groupssql serversql-server-2012

I recently had done a SQL patch on a two node SQL availability group of SQL Server 2012. Everything went well as far as I can tell. Now two days after the patch unexpected failover happened. Then today as well unexpected failover happened.

For a temporary period, the primary of the availability group goes in to resolving state and forces it into failover to resume data movement as normal.

Now, I am not sure if this is anything related to the patch. Because the error logs show other issues. Here are the logs from the two times this happened.

Date 6/11/2018 4:40:06 AM Log SQL Server (Current – 6/11/2018
2:25:00 PM) Source Logon Message Error: 983, Severity: 14, State: 1.

Date 6/11/2018 4:40:06 AM Log SQL Server (Current – 6/11/2018
2:25:00 PM) Source Logon Message Unable to access database
'accounting' because its replica role is RESOLVING which does not
allow connections. Try the operation again later. Date 6/11/2018
4:40:06 AM Log SQL Server (Current – 6/11/2018 2:25:00 PM)

And today Date 6/14/2018 3:13:47 PM Log SQL Server (Current –
6/14/2018 3:14:00 PM) Source spid6s Message SQL Server has
encountered 1 occurrence(s) of I/O requests taking longer than 15
seconds to complete on file [C:\Program Files\Microsoft SQL
Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\EveriT31_Park.mdf] in database
[EveriT31_Parx] (21). The OS file handle is 0x0000000000000B00. The
offset of the latest long I/O is: 0x00000002750000

Date 6/14/2018 3:14:04 PM Log SQL Server (Current – 6/14/2018
3:14:00 PM) Source spid127 Message Error: 18056, Severity: 20, State:
46.

Date 6/14/2018 3:14:04 PM Log SQL Server (Current – 6/14/2018
3:14:00 PM) Source spid127 Message The client was unable to reuse a
session with SPID 127, which had been reset for connection pooling.
The failure ID is 46. This error may have been caused by an earlier
operation failing. Check the error logs for failed operations
immediately before this error message. Date 6/14/2018 3:14:05 PM
Log SQL Server (Current – 6/14/2018 3:14:00 PM)

The AG is set for automatic failover but the alwayson group is not really changing roles. After these episodes, the primary remains primary and the secondary remains secondars.

Every log I look at the first error I get is either the I/O issue or connection pool error (Error: 18056, Severity: 20, State: 46.). I hesitate to say the problem is really that. I don't think it is related with the SQL patch I just did. I don't see anything that is related to it and I want to rule that one out.

Can anyone suggest what might be causing these? I looked at some articles, but none gave me a clear answer.

Best Answer

it is set for automatic failover but the alwayson group is not really changing roles. After these episodes, the primary remains primary and the secondary remains secondars.

I've seen this pattern (PRIMARY -> RESOLVING for a brief period -> PRIMARY, and the same for the SECONDARY) before. In my case, we were using a file share witness as a third node for quorum. When this "not failover" happened, it was because the server where the file share witness lived had been taken down for maintenance.

Windows Server Failover Clustering (WSFC) uses a process called the Resource Hosting Subsystem (RHS) to monitor the health of resources in the cluster. This includes both AG nodes, and in my case the file share witness resource.

While the AG (and cluster) can stay up and running with just the two AG nodes, after the file share witness has been down for 5 minutes, a process called RHS Recovery is started. You can read more about that here: Understanding how Failover Clustering Recovers from Unresponsive Resources

If all your resources are hosted in the same RHS process (which is the default behavior), this will cause that "blip" in AG availability while the RHS.exe process is restarted.

Normally after one failure, the problematic resource should be automatically marked as "SeparateMonitor" started up in it's own RHS.exe process (so that it doesn't continue to cause problems for your other cluster resources). So I'm surprised this happened more than once, but maybe someone changed the setting back, or the behavior is different for different resources or Windows versions.

You can check if that setting is on, or set it manually, in the Cluster Manager. Here's a screenshot:

screenshot of file share witness in cluster manager

Anyway, this problem, and especially your comment, sounded very familiar to me so I thought I'd share the experience in case it helps you or others.

I blogged in detail about the troubleshooting process I used for this situation here: Troubleshooting an AG Failure

You should see evidence that RHS Recovery is the root cause of your problem by pulling the cluster log (run the Get-ClusterLog PowerShell command) and checking for messages around the time of your AG blip that mention "DeadlockMonitor", which is what checks "down" resources and initiates the recovery process.