Sql-server – AlwaysOn Availability Groups go to Resolving status

We have a two-node cluster set up running SQL Server 2012 on Windows Server 2012 R2. The base cluster consists of only those two nodes. The quorum is set up as node + disk majority, with a shared disk sitting on an EqualLogic array connecting via iSCSI. Cluster configuration validates with no errors.

Earlier today, the two Availability Groups running on the cluster (both primary on Node #1) went into a resolving state on both nodes. Looking at the cluster events, there is nothing until it tries to restart the service. Everything underneath the Windows Failover cluster shows green (online and no warnings) – disks, network interfaces, nodes, etc.

Looking at the application log in Windows Event Viewer, the event showing that the Availability Groups are entering the Resolving state have been requested to do so by the cluster because a quorum could not be established. I cannot find anything else in any logs to support this, and the quorum passes during cluster validation.

The AGs do not recover. We found in the AlwaysOn log in SQL Server logs that the first node had a mirroring endpoint failure. This occurred immediately following a memory access violation with symptoms similar to a problem Microsoft has resolved with Cumulative Update 6, so we're going to try that next.

I was able to restart the SQL Server service on node 1 (the one with the memory access violations that the second note could not communicate with), and I was able to bring the availability group back online at that point. A reboot also brings everything back to 100%.

Do you have any insight as to why that wouldn't have failed node 1 and kept the database alive on node 2 when the mirroring endpoint failed? We're currently at AG level 3; would bringing that up to 4 trigger the failure on node 1 while leaving the service alive on node 2?

Best Answer

I've seen this happen a lot when folks only have one set of network cables connecting their servers - meaning, a pair of 1Gb Ethernet cables in each node, at best, and they're using those for both regular networking as well as iSCSI storage connectivity. (The fact that you're using Equallogic is a clue - I've seen a lot of those with 1Gb implementations.)

If you have any networking problems at all:

The two nodes won't be able to see each other
Neither node will be able to see the storage
Presto, no one sees a majority, and you lose quorum

Things that can cause this include:

Backup software (doing huge reads from disk while simultaneously saturating network)
Running a CHECKDB (again, huge reads from disk plus huge writes to TempDB, which can prevent the cluster heartbeats from getting through if you only have one network interface for both regular networking and storage)

To work around it:

Use separate network interfaces for regular networking and iSCSI (like a pair of 1Gb (at least) ports just dedicated to iSCSI, and nothing else)
Use faster network interfaces (like 10Gb instead of 1Gb)
Do less disk/network-intensive work (stop doing backups and CHECKDB, ha ha ho ho, but also ease up on the index rebuilds)

Best Answer

Related Solutions

Sql-server – Availability mode – manual failover mode, best practice for availability mode

Sql-server – Multiple named Instances and Availability Groups

Related Question