Why does the secondary server not failover

availability-groupsclusteringhigh-availability

I hope someone can help. I will try to summarise as best as I can.

We have four servers in a cluster with an AG set with Database Health Detection. Three servers are on site1 with the fourth on a separate DR site2. We are testing DR by cutting the link between the two sites.

As expected, when the link is cut, the three servers on site1 stay up and the server on site2 loses quorum and the database goes into recovery mode until the link is re-established.
If I perform a manual failover to site2 before we cut the link, the same thing happens and I have to force the quorum manually to bring the server back up on site2.

This isn't a big deal, but I would prefer it to stay up, so I added an Azure witness to the cluster. However testing has given us unexpected results.

  1. With an Azure witness, if I shut down the 2 secondary servers on site1 before we cut the link (leaving just one server on each site), then cutting the link initialises an automatic failover to site2. This is great, but I want the other servers to be left up and running.
  2. If I leave the 3 servers up on site1 up and we cut the link, the server on site2 loses Azure (even though it still has internet) and the server loses quorum. That I don't understand at all.

Wouldn't the server in site2 think that it was the last remaining server and failover to become the primary?

Additionally, I've tried to find out what the exact events are that happen when a failover occurs.

  1. Does the primary tell the secondary to failover and if so, how does that work if the link is cut? Surely it can't.
  2. Or does the secondary realise that there is no handshake and assume it's now the Primary? If so, wouldn't I end up with both sites having a primary server?

I've tried searching for sources of information. Any help would be greatly appreciated.

Best Answer

Wouldn't the server in site2 think that it was the last remaining server and failover to become the primary?

No, because it is part of a group with only two votes, which is not a majority. For all the DR side knows there are three healthy nodes on the primary side which have quorum.

When you gracefully shut down the primary side, Dynamic Quorum reconfigured the votes.

Quorum is explained here: https://docs.microsoft.com/en-us/windows-server/failover-clustering/manage-cluster-quorum