Sql-server – Availability Groups automatic failover is not so automatic

availability-groupsclusteringfailoversql serversql-server-2016

Behavior when I stop SQL Server Service :

  • The availability group fails (resolving on SECONDARY SQL SERVER and fail state on Cluster management).

  • SQL Server does not connect with the Cluster ip (I'm using a fixed cluster ip inside the software for it to connect on both SQL1 and SQL2. using x.x.x.10 should connect me on both x.x.x.9 (SQL1) and x.x.x.8(SQL2) because of the windows failover cluster.

    • I need to manually move AG to secondary ( I connect on SQL2 and right click > failover).
  • To return everything as it was, I need to start SQL1 SERVICE manually, connect on SQL1 SSMS and right click > failover manually. Stop NODE2 service on cluster management so it turns SQL1 node primary again.

Behavior when I stop the primary cluster node:

  • The availability group goes to SQL2/NODE 2 (Secondary is now primary).

  • Primary AG is not Resolving…

  • SQL Server connect with the cluster ip ( I'm talking about SSMS . connecting with ip X.x.x.10, connects me with x.x.x.8 that is the secondary node, after killing node x.x.x.9 ( Primary ).

  • When I start the cluster node (primary) again, it doenst automatically returns do SQL1 (primary). connecting with x.x.x.10 still connects me with the secondary node.

  • If I want to make SQL1 primary again, I need to **stop** node2 cluster service and start it again. this way, SQL1 (primary) is primary again connecting with Cluster IP.

This is totally not the automatically I think I would have. I thought that if SQL Server dies, all connections goes to the secondary automatically, but it just failover if the primary NODE is down (not sql server service). and then to return to the primary, I need to STOP SQL2 cluster service, so the primary becomes primary again.

Is there something that I'm missing?

We are not using a Failover Cluster with shared storage. There are 2 servers with their own disks.

Best Answer

You have a quorum issue in this setup. There are only two participants in the cluster:

  • SQL1
  • SQL2

If SQL1 goes down, there's no way for SQL2 to know that it should be the primary. For all SQL2 knows, the network has gone down and SQL1 is still operating as the primary. Thus, SQL2 goes to the RESOLVING state (it's not sure what to do). This is to prevent a "split brain" scenario.

For a two-node cluster, generally you need to add a File Share Witness as a resource in the WSFC cluster. This will allow the remaining instance, SQL2, to establish quorum with the FSW, and take over as the primary.

Additionally, the behavior of AGs is not to automatically fail back to the "original primary" as soon as it's back up. Automatic failover is only in the case of an outage on the current primary, it's not intended to "automatically fail back."


Note that the other scenario you described, stopping the primary cluster node, is not how you want your AGs to failover. Don't do that:

Do not use the Failover Cluster Manager to manipulate availability groups, for example:

  • Do not add or remove resources in the clustered service (resource group) for the availability group.

  • Do not change any availability group properties, such as the possible owners and preferred owners. These properties are set automatically by the availability group.

  • Do not use the Failover Cluster Manager to move availability groups to different nodes or to fail over availability groups. The Failover Cluster Manager is not aware of the synchronization status of the availability replicas, and doing so can lead to extended downtime. You must use Transact-SQL or SQL Server Management Studio.