Sql-server – SQL Cluster failover doesn’t work after applying SQL Server 2016 Service Pack 2

sql-server-2016

I tried to apply SQL Server 2016 Service Pack 2 on the clustered instance of SQL Server (SQL Server 2016 SP1 CU7, 2 nodes, active-passive architecture). We did this patching previously on the other cluster with similar configuration and it went fine. I started with the passive node, the installation succeeded but, when I tried to move SQL role to the newly patched node (in order to start patching the second node), the failover failed. In the cluster events log I found the corresponding error:

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

As per error message I checked system event log on passive node and I found few consecutive errors at that time:

6/9/2018 7:08:07 PM

Cluster resource 'SQL Server' of type 'SQL Server' in clustered role 'SQL Server (MSSQLSERVER)' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

6/9/2018 7:08:07 PM

The Cluster service failed to bring clustered role 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

6/9/2018 7:08:07 PM

Clustered role 'SQL Server (MSSQLSERVER)' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

and eventually the critical one at 6/9/2018 7:08:37 PM

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

Unfortunately, this information is not very informative for me, so I tried to dig into cluster log, there is definitely more information there, but I still struggle to find the root cause and solution.

You can find the full cluster log here: https://1drv.ms/u/s!Aiz8LIBP787C6ABO8aqfN9ayyFC0

I think the key to solving the issue is the cluster log, but the error messages are not clear to me. Perhaps some of you guys spotted this issue before?

I think it's worth to mention we run our SQL Clusters on Azure and we use S2D technology for cluster storage purposes. I mentioned about this as we can see following erros in cluster log:

000009c8.00002e04::2018/06/09-17:07:35.889 ERR   [API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.
000009c8.00002e04::2018/06/09-17:07:35.889 INFO  [CftCache] Querying whether disk {166faa04-8d1a-4901-e0b4-a7c75bdd54e3} supports cache state [ReadWrite]
000009c8.00002e04::2018/06/09-17:07:35.889 INFO  [CftCache] Can't query seek penalty storage property for device [Disk: Id {166faa04-8d1a-4901-e0b4-a7c75bdd54e3} # 2 #paths 0 #id 0 H      ], error 31.
000009c8.00002e04::2018/06/09-17:07:35.891 INFO  [CftCache] Disk [Disk: Id {166faa04-8d1a-4901-e0b4-a7c75bdd54e3} # 2 #paths 0 #id 0 H      ] Path \\?\Disk{166faa04-8d1a-4901-e0b4-a7c75bdd54e3} candidate? true cacheSupported? true supportStatus 0
000009c8.00002e04::2018/06/09-17:07:35.892 ERR   [API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.

Perhaps it's an issue with storage? But if that's the case why it has been revealed by sql patching?

Best Answer

  1. Actually, it looks like storage issue. I would recommend you check S2D storage. This and this might help you.

  2. Assuming you already run SQL Cluster on top of S2D, what is the number of nodes?

If it is a 2-node S2D cluster, I suggest you consider to move to AlwaysOn Availability Groups where is no need in shared storage or Failover Cluster Instances but on top of StarWind VSAN Free. AlwaysOn Availability Groups provides you with active-passive application level replication. StarWind will do active-active block-level replication featuring SQL instance failover between nodes.