Behavior when I stop SQL Server Service :
-
The availability group fails (resolving on SECONDARY SQL SERVER and fail state on Cluster management).
-
SQL Server does not connect with the Cluster ip (I'm using a fixed cluster ip inside the software for it to connect on both SQL1 and SQL2. using
x.x.x.10
should connect me on bothx.x.x.9
(SQL1) andx.x.x.8
(SQL2) because of thewindows failover cluster
.- I need to manually move AG to secondary ( I connect on SQL2 and
right click > failover
).
- I need to manually move AG to secondary ( I connect on SQL2 and
-
To return everything as it was, I need to start
SQL1 SERVICE
manually, connect onSQL1 SSMS
andright click > failover
manually. Stop NODE2 service on cluster management so it turns SQL1 node primary again.
Behavior when I stop the primary cluster node:
-
The availability group goes to SQL2/NODE 2 (Secondary is now primary).
-
Primary AG is not Resolving…
-
SQL Server connect with the cluster ip ( I'm talking about SSMS . connecting with ip
X.x.x.10
, connects me withx.x.x.8
that is the secondary node, after killing nodex.x.x.9
( Primary ). -
When I start the cluster node (primary) again, it doenst automatically returns do SQL1 (primary). connecting with
x.x.x.10
still connects me with the secondary node. -
If I want to make SQL1 primary again, I need to
**stop** node2 cluster service
and start it again. this way, SQL1 (primary) is primary again connecting with Cluster IP.
This is totally not the automatically I think I would have. I thought that if SQL Server dies, all connections goes to the secondary automatically, but it just failover if the primary NODE is down
(not sql server service). and then to return to the primary, I need to STOP SQL2 cluster service, so the primary becomes primary again.
Is there something that I'm missing?
We are not using a Failover Cluster with shared storage. There are 2 servers with their own disks.
Best Answer
You have a quorum issue in this setup. There are only two participants in the cluster:
If SQL1 goes down, there's no way for SQL2 to know that it should be the primary. For all SQL2 knows, the network has gone down and SQL1 is still operating as the primary. Thus, SQL2 goes to the RESOLVING state (it's not sure what to do). This is to prevent a "split brain" scenario.
For a two-node cluster, generally you need to add a File Share Witness as a resource in the WSFC cluster. This will allow the remaining instance, SQL2, to establish quorum with the FSW, and take over as the primary.
Additionally, the behavior of AGs is not to automatically fail back to the "original primary" as soon as it's back up. Automatic failover is only in the case of an outage on the current primary, it's not intended to "automatically fail back."
Note that the other scenario you described, stopping the primary cluster node, is not how you want your AGs to failover. Don't do that: