Sql-server – Availability Group failover stuck in Resolving State

availability-groupssql serversql-server-2012windows-server

I am setting up a lab for testing automatic failover for SQL Server

I have a 2 node WSFC (SQL-CLUSTER-TEST-01) with the nodes

SQL-TEST-01
SQL-TEST-02

A listener SQL-HA-GRP-01

My initial testing is to just power off either of the nodes (virtual machines). It it failing correctly in one direction – SQL-TEST-01 to SQL-TEST-02. But when SQL-TEST-02 is the primary and I pull the power on it, the availability group goes into a permanent "resolving" state which I can only resolve by powering on SQL-TEST-02.

When I check the logs or try to view the properties of the availability group it says the quorum was lost.

What am I missing?

Best Answer

Were you able to get this sorted out?

If not , try the following:

Go to the Failover Cluster Manager ( cluadmin.msc from Run).
Expand the Cluster name ( Expand meaning click on the + sign ).
Expand Services and Applications. You should now see your Availability Group listed there.
R-click on that and go to Properties.
In the General Tab you should see Preferred Owners.

Since you've mentioned this to be a lab environment, play around with the options ( by selecting both or alternating selection of preferred owners ) and see where that gets you. Also the tab next to General would specify if you need it to fail back. Post back with results.

Related Solutions

SQL Server 2012 – Fix AlwaysOn Availability Group Automatic Failover

If I disconnect DEV-AWEB5

Define "disconnect", if you will. My guess is you kept the box up but took SQL Server down.

I cannot connect to the Group Listener (DevListener), but I can ping it and it will respond to my ping

That's because the listener is just a virtual network name (VNN) within the WSFC cluster resource group for the represented availability group. Your DEV_AWEB5 node still owns the cluster resource group, but it's just the AG cluster resource most likely that is in a failed state. The VNN must still be online (expected behavior). It's simply pointing to whatever node is owning that resource group (in this case, DEV-AWEB5). In fact, if you had PowerShell remoting enabled, and you ran the following:

Invoke-Command -ComputerName "YourListenerName" -ScriptBlock { $env:computername }

Likewise, if you can RDP into DEV-AWEB5 (provided you have the capability and accessibility, etc.) then you'd be able to RDP using the listener name (mstsc /v:YourListenerName). It's just a VNN.

The return of that would be the computer name of your owning node.

By all of your symptoms, I'd be willing to bet that you've reached your failover threshold. The failover threshold determines how many times the cluster will attempt to failover your resource group in a specified time period. The default of these values max failovers n - 1 (where n is the count of nodes) in a period of 6 hours. You can see that through the following WSFC PowerShell command:

Get-ClusterGroup -Name "YourAgName" |
    Select-Object Name, FailoverThreshold, FailoverPeriod

That just gives you the settings (which you can modify if you so choose, of course).

The best way to prove that this is the case for you, you would need to generate the cluster log (the system event logs only go into detail as far as " has failed", or something like that).

Get-ClusterLog -Node "YourClusterNode" -TimeSpan <amount_of_minutes_since_failure>

That'll by default get put into the "C:\Windows\Cluster\Reports" folder, and the file is called "Cluster.log".

If you were to open up that cluster log, you should be able to find the following string in there, indicating exactly what happened and why it happened:

Not failing over group [YourClusterGroupName], failoverCount [# of failovers], failover threshold [failover threshold value], nodeAvailCount [node available count].

The above message is simply WSFC telling you that it will not failover your group because it's happened too much (you hit the threshold).

Why does this happen? Simply to prevent the Ping-Pong effect of cluster resources going back and forth too frequently between nodes.

Whereas this would be common to hit these thresholds in failover testing, in production it would typically point to a problem that should be investigated.

Sql-server – AlwaysOn Availability Groups go to Resolving status

I've seen this happen a lot when folks only have one set of network cables connecting their servers - meaning, a pair of 1Gb Ethernet cables in each node, at best, and they're using those for both regular networking as well as iSCSI storage connectivity. (The fact that you're using Equallogic is a clue - I've seen a lot of those with 1Gb implementations.)

If you have any networking problems at all:

The two nodes won't be able to see each other
Neither node will be able to see the storage
Presto, no one sees a majority, and you lose quorum

Things that can cause this include:

Backup software (doing huge reads from disk while simultaneously saturating network)
Running a CHECKDB (again, huge reads from disk plus huge writes to TempDB, which can prevent the cluster heartbeats from getting through if you only have one network interface for both regular networking and storage)

To work around it:

Use separate network interfaces for regular networking and iSCSI (like a pair of 1Gb (at least) ports just dedicated to iSCSI, and nothing else)
Use faster network interfaces (like 10Gb instead of 1Gb)
Do less disk/network-intensive work (stop doing backups and CHECKDB, ha ha ho ho, but also ease up on the index rebuilds)

Best Answer

Related Solutions

SQL Server 2012 – Fix AlwaysOn Availability Group Automatic Failover

Sql-server – AlwaysOn Availability Groups go to Resolving status

Related Question