AlwaysOn Auto failover. Make it faster.

availability-groupsperformance

I cause a 2016 Server Standard Database to AlwaysOn Auto failover (its a test DB single table), I have to wait up to 60 seconds to connect again via the listenser to the standby.

No data is lost. But 100% up time is not.

What can I tweak to speed it up?

Best Answer

I wrote Windows Console app to run the tests and specify TrustServerCertificate=True in connection. It loops, connects and gets data.

Thank you for the code - it helps! The encrypted connection information shouldn't have any bearing on this, though.

The latency is always 28 seconds, following failover.

This can happen for a variety of reasons. There are some subtle nuances that make it seem like it isn't actually up. Let's take a look.

string connetionString = "Trusted_Connection=yes;database=MyDataBse;server=fslistener,1433;Encrypt=True;TrustServerCertificate=True";

The connection string is not setting something called Connection Timeout which handles the way a connection is deemed to have timed out or failed. This is in seconds. The default is 15 seconds. Knowing this...

Assume SQL Server failed and we needed to fail over. The fastest this application would connect is the time it takes to failover. This is a combination of a few things:

The cluster detecting an issue
The cluster arbitrating of resources
SQL Server recovering the databases

Let's also assume that the cluster is setup with defaults. This means it'll take a minimum of 5 seconds to notice that something is wrong. It may take longer, depending on how the node fails up to a maximum of the SQL Server healthcheck interval (default 30 seconds).

You can already see that finding the failure can take anywhere from 5-30 seconds, by default.

The next step is initiating the failure and arbitrating for resources. This should take a trivial amount of time, a few seconds at most. This is where the listener is moved and becomes responsive to new login attempts.

Additionally there is the time is takes for the databases to redo and become available - especially if there is a default database specified that is part of the AG (versus one that isn't, say master). This can take anywhere from near instantaneous to many minutes depending on use and hardware subsystems. Let's assume, for this, it only takes 3 seconds.

Adding this all up we have:

5 seconds to find the failure
2 seconds to arbitrate and come online
3 seconds for the database to be available

This totals out to 10 seconds. Now, remember the Time Out I was talking about earlier. If the application attempts to connect and receives a login timeout, that means we waited 15 seconds and didn't get a response. This doesn't mean we constantly tried for 15 seconds, it means we sent a request and patiently sat there, waiting. Adding in a single timeout wait of 15 seconds now brings our total up to 25 seconds. That's assuming a whole bunch of things.

Wrapping up

In the end, the actual failover time will be held in the SQL Server errorlog by finding the start of the failover and the database online messages. Additionally, extended events can be used to find the actual failover times.

The application may (most likely will) see a longer "downtime" due to the way it is coded. In this instance, a default timeout value of 15 seconds is most likely adding and additional overhead. The code also sleeps for 1 second between retries which makes a total of 16 seconds. The connection string has a default database which means it takes us even longer to connect as the database is required to be online for the connection to be established. It takes time to detect and arbitrate the failover, this needs to be taken into account.

More information: https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlconnection.connectiontimeout(v=vs.110).aspx

Related Solutions

SQL Server AlwaysOn – How Automatic Failover Works

Does it mean that former primary replica (A) will become primary replica automatically?

No, all that means is when your replica comes back into the picture, and when the Availability Group database gets back in a SYNCHRONIZED state that it would be failover ready. That operation will not happen automatically. You indeed would have to either do this "failback" manually, or engineer a way to automate this (rather simple, let me know if you want to explore those options).

From a high-level view, your listing of steps is complete after step #3.

SQL Server – Pausing AlwaysOn Availability Group Replication for Failover

I think you are absolutely right on here:

The installer creates the distributor on the publisher, which I assume will break replication after a failover and require generation of new snapshots and reinitializion of subscriptions.

a good way to relate the replication to the availability group is:

The primary is SP1 and the secondary replica is SS2. The distributor is on another server SD3. The replication is working fine. We redirected the Original Publisher to the AG Listener Name using this script:

USE distribution;
GO  
EXEC sys.sp_redirect_publisher   
    @original_publisher = 'MyPublisher',  
    @publisher_db = 'MyPublishedDB',  
    @redirected_publisher = 'MyAGListenerName';

You can find more details on this link:

Replication log reader not updated to new primary after availability group failover

In an AlwaysOn availability group a secondary database cannot be a publisher.

The failover of a distributor on an availability database is not supported.

I would suggest you use another server as a distributor, as you are probably already doing.

I normally use another server as a distributor and several distributor databases depending on the size and how busy your publications are.

I am not entirely sure about the push subscriptions but for the other stuff you could have a look at this link:

Setting up Replication on a database that is part of an AlwaysOn Availability Group

SRV1: Original Publisher

SRV2: Publisher Replica

SRV3: Publisher Replica

SRV4: Distributor and Subscriber (You can choose a completely new server to be the distributor as well, however do not have a distributor on any of the publishers in this case as the failover of a distributor is not supported in this case).

maybe this will not apply on your case, but just for reference:

Configure Replication for Always On Availability Groups (SQL Server)

Am I on the right track in saying the most reliable method of "pausing" replication would be to use sp_MSstopdistribution_agent on the publisher (for PUSH) and on subscriber (for PULL) before failover to the (non-distributor) node, then do a sp_MSstartdistribution_agent after we fail back again?

I would say in most cases yes, I used to use those procedures as you can see (scripts included) on this link:

How to restart the distributor agent of transactional replication?

Now I prefer to use the following method:

-- on the publisher
-- this gives you the distributor server name and distributor database name

sp_helpdistributor


-- now connect to the distributor server and database 
-- get the job name 

-- get the distributor job name for the publication you want
-- in this example my publication is called 'DEOrder'
use distribution
go

my database and publication are called ATOrder in the example below.

-- the snapshot agent
select  job_name=name, publisher_db, publication from  distribution.dbo.mssnapshot_agents
where publisher_db = N'ATOrder'

-- the distributor jon transactional replication
select job_name=name, publisher_db, publication from  distribution.dbo.MSlogreader_agents
where publisher_db = N'ATOrder'

-- if it is merge replication
select  job_name=name, publisher_db, publication  from MSmerge_agents
where publisher_db = N'ATOrder'

-- who are my publishers and who are my subscribers
select * from MSsubscriber_info

-- just checking dbs
select * from MSpublisher_databases



-- get the distributor job (log reader) for a specific publication
select job_name=name, publisher_db, publication from  distribution.dbo.MSlogreader_agents
where publisher_db = N'ATOrder'

-- take note of the job name
--my_publication_server_name_and_instance2-ATOrder-25

-- check that the job is running at the moment - it must be!!
sp_runningjobs 'my_publication_server_name_and_instance2-ATOrder-25'


--check I am targeting the right job
exec msdb.dbo.sp_help_job @job_name = 'my_publication_server_name_and_instance2-ATOrder-25'

--stop the job
exec msdb.dbo.sp_stop_job @job_name = 'my_publication_server_name_and_instance2-ATOrder-25'

--start the job
exec msdb.dbo.sp_start_job @job_name = 'my_publication_server_name_and_instance2-ATOrder-25'

for more information, as to see if everything is going on alright, or who and what is getting a bit slow, you can check this link:

Replication Monitor Information using T-SQL

We are using SQL 2014 on both publisher and subscriber.

Lucky you

Best Answer

Related Solutions

SQL Server AlwaysOn – How Automatic Failover Works

SQL Server – Pausing AlwaysOn Availability Group Replication for Failover

Related Question