Sql-server – SQL Server Always On File Share Witness (Quorum vote) on different subnet to other nodes

availability-groupsclusteringsql server

I am currently having issues with some of my Availability groups where node 1 and node 2 loose connectivity between each other "A connection timeout has occurred on a previously established connection to availability replica".

The errors in the failover cluster manager say "File share witness resource 'File Share Witness' failed to arbitrate for the file share" The server the file share sits on hasn't restarted or had any issues and all permissions are working.

The only thing I can see is that the file share server is on a different subnet to the other 2 SQL Server nodes in the cluster.

Could someone confirm is having the file share server on a different subnet a big no no in an AlwaysOn environment? All the firewall rules are in place as it can talk to the other nodes but out of hours (usually) it loses connectivity.

The other weird thing is there are 3 votes in the quorum including the file share, so even if the file share loses connectivity to the failover cluster, node1 & node2 shouldn't lose connectivity between each other as there's enough votes for quorum (2)

Best Answer

Could someone confirm is having the file share server on a different subnet a big no no in an AlwaysOn environment?

It's completely fine to have the FSW on a different subnet, there is absolutely nothing wrong with that. There is no need to have it on the same subnet, in fact there is an Azure witness which definitely won't be on the same subnet and it works without issue.

"A connection timeout has occurred on a previously established connection to availability replica"

Seems to be pointing to either something in the network being an issue or if this is on a virtual machine something is happening to the guest/host which is giving you this trouble. Given there are a plethora of in-depth configuration settings at the host, guest, and OS level that can contribute to this I won't go into any further depth as it'd be out of scope of this site.

The errors in the failover cluster manager say "File share witness resource 'File Share Witness' failed to arbitrate for the file share" The server the file share sits on hasn't restarted or had any issues and all permissions are working.

This means that whomever attempted to arbitrate for the witness was only one vote off of having quorum for the cluster. Since it's a two node cluster, if the nodes couldn't talk to each other they'd be in this exact situation.

If neither node can talk to each other (obviously an issue) and neither node can talk to the FSW (another issue) this makes me wonder what's broken in the infrastructure - again, either at the virtual layer or the physical (network) layer. It's clear something is going on to cause this and is specific to your environment, not SQL Server.

The other weird thing is there are 3 votes in the quorum including the file share, so even if the file share loses connectivity to the failover cluster, node1 & node2 shouldn't lose connectivity between each other as there's enough votes for quorum (2)

Yes, however I'm betting the nodes lost connectivity to each other. There are probably some entries about missed heartbeats, connectivity to ~3343, regroups, etc., in the cluster log.

Connectivity doesn't mean votes, connectivity means health checks. Once health checks fail, nodes become partitioned and that's when these events occur. You'll need to find out what occurred in your environment around the time this happened. If it happens quite often and on schedule, then it's some task or software in the environment, if it happens randomly then it's most likely infrastructure issues such as networking or Host/Guest/OS settings if it happens when under load.