Sql-server – Availability mode – manual failover mode, best practice for availability mode

availability-groupshigh-availabilitysql server

We don't want automatic failover for our availability group, but I have set availability mode to Synchronous, was that a mistake?

The availability group runs on vmware with windows 2012R2, and I got the error messages below after the cluster "crashed". (There was some manual migration off the vmware at the same time)

Can I related the crash to availability mode settings?

Thank You in advance.
Regards Odd

Kl 13:58:56 – Critical : Cluster node ‘' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Kl 13:58:56 – Critical : Cluster node '' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Kl 13:59:58: -Critical: File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\Quorum'. Please ensure that file share '\\Quorum' exists and is accessible by the cluster.
Kl 13:59:58: – Error: Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Kl 13:59:58: -Critical: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Kl 14:01:22: – Error: Cluster resource '_Grp' of type 'SQL Server Availability Group' in clustered role '_Grp' failed.

Kl 14:01:34: – Error: Cluster resource '_Grp' of type 'SQL Server Availability Group' in clustered role '_Grp' failed.

Kl 14:01:34: – Error: The Cluster service failed to bring clustered role '_Grp' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Best Answer

Can I related the crash to availability mode settings?

No. By the looks of your log messages, you actually lost cluster quorum due to the removal of two nodes' votes, and then your witness (file share). Is this a 3 node cluster with a file share witness perchance? In that case, if you pulled these events from one of the node's event logs, then it may appear to each of the nodes that there is a lack of communication with all voters. That would generate a similar, if not same, error footprint like you have above. Nobody can talk to anybody, if that is the case.

During that time, the quorum will be lost as you are currently seeing. There is a level of assuming here, as I'd need to see way more diagnostic information to pinpoint the cause of voter removal, but that is why the quorum was lost.

Regardless, this appears to be a problem that surfaced in a down cluster, in which case your availability mode would have nothing to do with the WSFC cluster failing.

As for "best practices" for the availability mode to go with, you need to determine requirements for data loss, performance impact, and a few other factors that are best described in this BOL reference on Availability Modes.

Related Solutions

Sql-server – SQL Server cluster not available after failure of the “main” server

Without looking at the cluster logs or any other form of error reporting, all I can do is guess here.

But my initial thoughts are that you may have hit the failover threshold. By default, this is going to be set to a maximum of n - 1 failures (where n is the number of nodes) over a period of 6 hours. Yes, that's a long time, and especially in a 2-node cluster that isn't very many failures (only equates to one failure). This threshold is set to prevent the Ping-Pong effect of cluster groups.

In production, this is probably a good thing. But in testing/development/non-prod it is pretty common to run into this initially perplexing problem, as you may be trying to consecutively failover. It is worth noting that these parameters are 100% configurable. All you need to do is go into the properties of the cluster group, and in the "failover" tab you will have the option to change these two parameters (Maximum failures in the specified period and Period (hours). Here is what this looks like in the Failover Cluster Manager:

enter image description here

Note: in my screenshot, the threshold is set to 2 because I have a 3-node cluster.

Likewise, this can be seen with PowerShell (accessing the FailoverClusters module).

# you may need to set your cluster group name to whatever it is named
# in your environment
#
Get-ClusterGroup -Name "SQL Server (MSSQLSERVER)" | 
    Select-Object Name, FailoverThreshold, FailoverPeriod

Sql-server – Always On AG is Down when service is stopped

It could be the issue described in INF: AlwaysOn – The secondary database doesn’t come automatically when the primary instance of SQL Server goes down by Arvindh Kalidasan - Support Engineer, Microsoft GTSC.

In this blog we would discuss about behavior of AlwaysOn availability group where the secondary database doesn't come automatically when the primary instance goes down. The secondary database goes into Resolving state. On the failover cluster manager the resource appears in fail state.

[...] we found that if we stop SQL Service manually on primary replica, it fails over to second node only once. Any further attempts of stopping SQL Service (to test auto failover) would not cause failover.

The workaround posted there is:

[...] to have the value set to a higher number for "Maximum Failures in the specified period".

Maximum Failures in the specified period: set to 60

Period (Hours): set to 1

Best Answer

Related Solutions

Sql-server – SQL Server cluster not available after failure of the “main” server

Sql-server – Always On AG is Down when service is stopped

Related Question