Sql-server – SQL Server cluster not available after failure of the “main” server

clusteringsql server

I aksed this question on stackoverflow, but was asked to repost here so here it is:

I have a database that I'm setting up in a clustered environment. It'll be two servers with a shared storage. ServerA is the server where I setup the database, and ServerB is a node in the SQLServer cluster.

If I forcibly turn off ServerB everything keeps on working as expected, but if I turn off ServerA the instance is not available anymore.

Windows cluster is still alive (I can remote desktop onto the cluster via shared name), but the SQL is not visible, and cluster manager reports the resource as offline.

This kinda defeats the purpose of the cluster and I need to solve it, but it's not really in my area of expertise, and I have no idea where to start.

Thanks in advance.

EDIT (Additional info asked in comments): Windows was installed I added clustering feature for DTC and SQL Server. As for access, IP addresses of the machines are *.101 for main; *.102 for second server; *.99 for access point SQL server instances and *.111 for cluster manager. I can access 99 from both machines but can't access database at 101 or 102 from either.

EDIT #2 – I tried increasing the failover threshold (to 50 in 24 hours, for out testing purposes), but the server still doesn't wake up after cutting off the network for server1. The cluster comes back in about 20 sec or so, but SQL Server stays down. In the logs I have serveral errors and one Critical level message that is probably causing the fail. Message is:

The Cluster service is shutting down because quorum was lost. This
could be due to the loss of network connectivity between some or all
nodes in the cluster, or a failover of the witness disk. Run the
Validate a Configuration wizard to check your network configuration.
If the condition persists, check for hardware or software errors
related to the network adapter. Also check for failures in any other
network components to which the node is connected such as hubs,
switches, or bridges.

Best Answer

Without looking at the cluster logs or any other form of error reporting, all I can do is guess here.

But my initial thoughts are that you may have hit the failover threshold. By default, this is going to be set to a maximum of n - 1 failures (where n is the number of nodes) over a period of 6 hours. Yes, that's a long time, and especially in a 2-node cluster that isn't very many failures (only equates to one failure). This threshold is set to prevent the Ping-Pong effect of cluster groups.

In production, this is probably a good thing. But in testing/development/non-prod it is pretty common to run into this initially perplexing problem, as you may be trying to consecutively failover. It is worth noting that these parameters are 100% configurable. All you need to do is go into the properties of the cluster group, and in the "failover" tab you will have the option to change these two parameters (Maximum failures in the specified period and Period (hours). Here is what this looks like in the Failover Cluster Manager:

enter image description here

Note: in my screenshot, the threshold is set to 2 because I have a 3-node cluster.

Likewise, this can be seen with PowerShell (accessing the FailoverClusters module).

# you may need to set your cluster group name to whatever it is named
# in your environment
#
Get-ClusterGroup -Name "SQL Server (MSSQLSERVER)" | 
    Select-Object Name, FailoverThreshold, FailoverPeriod