SQL Server MultiSubnetFailover – Performance Gain for Single Subnet

availability-groupssql serversql-server-2017

On BOL I read following about MultiSubnetFailover=True:

The MultiSubnetFailover connection option should be set to True even if the availability group only spans a single subnet. This allows you to preconfigure new clients to support future spanning of subnets without any need for future client connection string changes and also optimizes failover performance for single subnet failovers.

As I understand the working of MultiSubnetFailover, with this option set the client driver sets up a socket for each IP address associated with the listener. They are all checked in parallel to speed up the process to find the online IP and the first one that response will be the one used for the connection. Here I see a performance gain.

But where is the performance gain in a single subnet? There is only IP associated with the listener.

Best Answer

When MultiSubnetFailover=True is set, SQL Server Native Client will aggressively retry the TCP connection.

So in

a multi-subnet scenario, it will attempt connections in parallel, which you already know.
a single subnet scenario, it will aggressively retry the TCP connection faster than the OS default TCP retransmit interval, so your reconnect time after an AG failover will be improved.

Preferred and Possible Owners

These values are set by SQL Server based on the settings for the Availability Group. The cluster has its own copy of the metadata of how things should work and SQL Server also has a copy of how it believes things should work.

This is great most of the time, when no one changes anything and both SQL Server and the cluster continue to hum along.

However, SQL Server knows what it believes should and shouldn't be owners of the resources based on the AG settings and reflects this by calling cluster APIs to set the expected behavior. You'll see this reflected in the HADR_CLUSAPI_CALL wait type.

Three Async Replicas

Let's look at SQL Server with an AG setup so that there are three Async commit replicas. If we do this, our AG and role resource should look as such:

Remember, async commit replicas cannot have automatic failover. Thus, this makes sense; our current role owner is the only possible and preferred owner. Since SQL Server is setup so that the only way to fail over is by forcing it then we don't want any other replica to pick up and start running with this, thus enforcing this at the clustering level.

Two Synchronous Manual Failover, One Async - Demo

The results will be the same as the Three Async demo. Why?

This is the same explanation as the three async replicas, we have no automatic failover yet so we don't want anyone AUTOMATICALLY taking over. Thus, the settings are the same.

Two Asynchronous Automatic Failover, One Async - Demo

Here is where things change and get interesting! Since we're now introducing automatic failover - and that's one of the roles of WSFC (along with health checking, metadata distribution, etc) - we're going to want to have preferred and possible owners.

Other Responses

We recently had a 5 minute outage when we replaced our CO-SQL01 server hardware ...

You've specifically removed the vote from CO and only had 1 other voter left then had an unexpected crash. Sounds like it was working as it should.

To be honest we have had a number of unexpected issues with using Multi-Subnet Failover Cluster Availability Groups, and it seems like the default preferred role owners and possible resource owners may be incorrect or at least not optimal for our scenario.

I'm going to be brutally honest here. Availability Groups are not a silver bullet, fix everything HA & DR. It seems that the technology is being used, but not too much is known about how it works or why it does these things on both an AG level or a WSFC level, so I completely agree it may not be setup properly... however, it is working as it is setup - I can't fault the cluster for doing what it is supposed to be doing.

currently are looking at using the new Distributed Availability Groups feature in SQL Server 2016 to split our Multi-Subnet AG into to two single subnet AGs (one for each datacenter) as a way to prevent these issues in the future.

If you're having this many issues with Availability Groups I would not foray into Distributed Availability Groups on your own. It may or may not fit your use cases - but if I were you, I'd bring in someone who helps architect these types of solutions and have your use cases ready. Otherwise, you're going to make another post like this.

We also think that will allow us to upgrade the cluster OS with minimal downtime.

One of the use cases for Distributed Availability Groups is for extremely low downtime cross cluster migrations, you're correct. If you're using Windows Server 2012R2, you can do rolling cluster upgrades [NOT available in 2012, must be R2] without needing to do Distributed Availability Groups.

And here are the default settings for the SQL-StackOverflow_AG IP resources

Yes, do not touch them :) Leave them as is.

Best Answer

Related Solutions

Availability Groups – Persisting Connections Over AG Failover

SQL Server – Availability Groups Using Multi-Subnet Clustering

Preferred and Possible Owners

Three Async Replicas

Two Synchronous Manual Failover, One Async - Demo

Two Asynchronous Automatic Failover, One Async - Demo

Other Responses

Related Question