Sql-server – SQL Server 2016 SP2 CU2 Availability Group Online but Cluster Offline

availability-groupsclusteringsql serversql-server-2016

Am I missing something basic here? My AVG is online and working, but the Cluster on which it resides is saying offline. I didn't think this is possible, so I either have a basic misunderstanding of how this should work, or I have something odd going on. I have

3 Node Multisubnet AOAG. 2 SQL Servers local data center sync, 1 SQL server DR async.

Get-ClusterResource results:

Cluster IP Address:                Failed
Cluster IP Address XX.XX.XXX.XX    Failed  (2nd subnet node)
Cluster name:                      Offline
File Share Witness:                Online
Availability Group Name:           Online
Availability local IP:             Online
Availability 2nd subnet IP:        Offline
Availability Group name:           Online

The databases are all synced and online. Availability group works fine. I get a lot of strange crashes that may or may not be related (CheckDB causes SQL to crash often despite this being a static test cluster with 8CPU and 128GB mem). Doesn't seem related at all. More baffled by the cluster being offline but AVG online. I have this same setup in prod (only diff is storage), and as expected, on that one Cluster name, and local IP are online

Best Answer

but the Cluster on which it resides is saying offline.

It's not offline, otherwise clussvc would stop and you'd have a spew of errors in your errorlog that the AG shutdown due to loss of cluster services and that it's waiting on the cluster services to startup before proceeding.

I believe you're basing the above quoted conclusion on:

Cluster IP Address: Failed

Cluster IP Address XX.XX.XXX.XX Failed (2nd subnet node)

Cluster name: Offline

This is just letting you know that the cluster name and associated IPs are not online. These are part of the core cluster resources but will not stop the cluster itself nor most of the services on it. In fact, if you try to remotely connect to the cluster using the cluster name (like RSAT) tools it should fail but that's just name resolution and administrative endpoints. This won't impact the cluster resources in a different resource group unless for some reason they use that name (I couldn't fathom why, though).

The databases are all synced and online. Availability group works fine.

Yes, this is because the cluster is running though you have some failed resources that should probably be taken care of sooner rather than later.

What should I be checking for basic Cluster running (other than AVG is running)?

This is going into the realm of Windows and while I have no problem talking about it, I'm not sure if this is the right place for it so I'll give you a small primer of items to check with WSFC (Windows Server Failover Clustering):

  • Is the "clussvc" running?
  • Is the witness online?
  • Can you remotely connect to the cluster? (Get-Cluster MyCluster)
  • Are any resources in a state other than up or running? (Get-ClusterResource -Cluster MyCluster)
  • What are the current votes in the cluster? (Get-ClusterNode -Cluster MyCluster | Select Name, NodeWeight, DynamicWeight [WS2012+])

Note that you don't have to use powershell but it's one of the easiest and fastest ways to get information about cluster resources without writing your own calls via the clustering API.

Aside: I wouldn't call Availability Groups AVG's, typical nomenclature would be AG or AGs.