Sql-server – Is it safe to run a Windows Failover Cluster Public and Heartbeat on a Single NIC

clusteringsql serversql-server-2016windows-server

I have a 2-Node Windows Failover Cluster (Windows Server 2016, SQL Server 2017) with a Quorum Disk Witness on fibre-connected shared storage.

How important is it that the heartbeat is on its own network with this Quorum setup and why?

Am I right in thinking that if the network goes down, the Quorum Disk Witness will stop the nodes trying to failover, as it will still vote that both nodes are online, even though the nodes themselves are saying the other is offline?

In this scenario I can't see what benefit there is to having the heartbeat on its own network. I completely get that teaming two NIC's gives resiliency, but separating the heartbeat (as far as I can see) doesn't offer much with this quorum setup.

Best Answer

How important is it that the heartbeat is on its own network with this Quorum setup and why?

Well what you have asked is really debatable and I should also add that quorum has little significance in cluster network configuration as such. Starting from Windows Server 2008 and onward Microsoft says that you can go on and configure WSFC without any heartbeat network connection. If you have not configured dedicated network for heartbeat the cluster validation wizard will only give you warning and that means your cluster is still supported. Now, but that does not means this is all good. Allow me quote a reason for dedicated NIC for heatbeat (Source)

Heartbeat communication is used for the Health monitoring between the nodes to detect node failures. Heartbeat packets are Lightweight (134 bytes) in nature and sensitive to latency. If the cluster heartbeats are delayed by a Saturated NIC, blocked due to firewalls, etc, it could cause the cluster node to be removed from Cluster membership. Intra-Cluster communication is executed to update the cluster database across all the nodes any cluster state changes. Clustering is a distributed synchronous system. Latency in this network could slow down cluster state changes.

So if you read above you can get some fair idea why a heartbeat communication might still be important.

Since you have windows server 2016 you can easily go with heartbeat communication without worrying about network binding order, an order which tell which network/route should be given priority. By default, Windows server 2016 uses the Interface Metric property of a network adapter to determine which route has the highest priority. The lower the Interface Metric property value, the higher the priority. More information in this support article

I also believe it is not too much of overhead to configure heartbeat network, if the heartbeat network goes down the WSFC will start using public network for cluster communication and cluster communication would still go on. I think it is more of segregating things and making cluster communication more secure with heartbeat network. BUT if your public network is teamed up well and has enough bandwidth to accommodate easily both cluster and client communication, by all means go ahead without heartbeat network. Please note if you just have public network all the client and internal cluster communications will go through this link so it has to be strong.

Here is what MVP,MCM Edwin Sarmiento has to say about Heartbeat network (Source)

But here's my take on it: even if we have multiple NICs per node that are teamed up, how sure are we that the network switches are redundant and highly available? I've seen DR exercises where only the servers are tested but not the underlying network architecture. Only when the network switches themselves fail do they realize that they are not at all highly available. I still recommend having a dedicated network for the heartbeat communication and if the customer can guarantee that the network layer - switches, routers, etc. - is highly available, then I'll be happy with a NIC teaming implementation (thus, my reason for having a dedicated network for the heartbeat.)

I believe he is correct, the focus should be more on making "Complete Network" redundant not just part of it.

Am I right in thinking that if the network goes down, the Quorum Disk Witness will stop the nodes trying to failover, as it will still vote that both nodes are online, even though the nodes themselves are saying the other is offline?

By network you mean the complete public network is down, in that case this could be single point of failure bringing down the whole WSFC. And this is what precisely Edwin emphasized on above quote. Now if you are saying due to connecting network issue one of the nodes in cluster was removed from cluster membership forcing cluster to calculate quorum and do failover, in that case, since you have 2 votes of Node and quorum disk(> 50 %) the WSFC will remain online and do failover. This network issue would not affect disk/Storage as they are connected via SAN not via cluster public network.

Additional reading:

Windows Server 2008 networking 3 part series

Disclaimer: I must add I am not an network engineer and the detailed discussion of network configuration for WSFC is not within my knowledge scope and believe a network enginner can definitely add more to this answer, I tried to answer your question to best of my knowledge. Hope this helps.

Related Question