AlwaysOn Auto failover. Make it faster.

availability-groupsperformance

I cause a 2016 Server Standard Database to AlwaysOn Auto failover (its a test DB single table), I have to wait up to 60 seconds to connect again via the listenser to the standby.

No data is lost. But 100% up time is not.

What can I tweak to speed it up?

Best Answer

I wrote Windows Console app to run the tests and specify TrustServerCertificate=True in connection. It loops, connects and gets data.

Thank you for the code - it helps! The encrypted connection information shouldn't have any bearing on this, though.

The latency is always 28 seconds, following failover.

This can happen for a variety of reasons. There are some subtle nuances that make it seem like it isn't actually up. Let's take a look.

string connetionString = "Trusted_Connection=yes;database=MyDataBse;server=fslistener,1433;Encrypt=True;TrustServerCertificate=True";

The connection string is not setting something called Connection Timeout which handles the way a connection is deemed to have timed out or failed. This is in seconds. The default is 15 seconds. Knowing this...

Assume SQL Server failed and we needed to fail over. The fastest this application would connect is the time it takes to failover. This is a combination of a few things:

  1. The cluster detecting an issue
  2. The cluster arbitrating of resources
  3. SQL Server recovering the databases

Let's also assume that the cluster is setup with defaults. This means it'll take a minimum of 5 seconds to notice that something is wrong. It may take longer, depending on how the node fails up to a maximum of the SQL Server healthcheck interval (default 30 seconds).

You can already see that finding the failure can take anywhere from 5-30 seconds, by default.

The next step is initiating the failure and arbitrating for resources. This should take a trivial amount of time, a few seconds at most. This is where the listener is moved and becomes responsive to new login attempts.

Additionally there is the time is takes for the databases to redo and become available - especially if there is a default database specified that is part of the AG (versus one that isn't, say master). This can take anywhere from near instantaneous to many minutes depending on use and hardware subsystems. Let's assume, for this, it only takes 3 seconds.

Adding this all up we have:

  1. 5 seconds to find the failure
  2. 2 seconds to arbitrate and come online
  3. 3 seconds for the database to be available

This totals out to 10 seconds. Now, remember the Time Out I was talking about earlier. If the application attempts to connect and receives a login timeout, that means we waited 15 seconds and didn't get a response. This doesn't mean we constantly tried for 15 seconds, it means we sent a request and patiently sat there, waiting. Adding in a single timeout wait of 15 seconds now brings our total up to 25 seconds. That's assuming a whole bunch of things.

Wrapping up

In the end, the actual failover time will be held in the SQL Server errorlog by finding the start of the failover and the database online messages. Additionally, extended events can be used to find the actual failover times.

The application may (most likely will) see a longer "downtime" due to the way it is coded. In this instance, a default timeout value of 15 seconds is most likely adding and additional overhead. The code also sleeps for 1 second between retries which makes a total of 16 seconds. The connection string has a default database which means it takes us even longer to connect as the database is required to be online for the connection to be established. It takes time to detect and arbitrate the failover, this needs to be taken into account.

More information: https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlconnection.connectiontimeout(v=vs.110).aspx