Does it mean that former primary replica (A) will become primary replica automatically?
No, all that means is when your replica comes back into the picture, and when the Availability Group database gets back in a SYNCHRONIZED
state that it would be failover ready. That operation will not happen automatically. You indeed would have to either do this "failback" manually, or engineer a way to automate this (rather simple, let me know if you want to explore those options).
From a high-level view, your listing of steps is complete after step #3.
I think you are absolutely right on here:
The installer creates the distributor on the publisher, which I assume
will break replication after a failover and require generation of new
snapshots and reinitializion of subscriptions.
a good way to relate the replication to the availability group is:
The primary is SP1 and the secondary replica is SS2.
The distributor is on another server SD3. The replication is working fine.
We redirected the Original Publisher to the AG Listener Name using this script:
USE distribution;
GO
EXEC sys.sp_redirect_publisher
@original_publisher = 'MyPublisher',
@publisher_db = 'MyPublishedDB',
@redirected_publisher = 'MyAGListenerName';
You can find more details on this link:
Replication log reader not updated to new primary after availability group failover
In an AlwaysOn availability group a secondary database cannot be a publisher.
The failover of a distributor on an availability database is not supported.
I would suggest you use another server as a distributor, as you are probably already doing.
I normally use another server as a distributor and several distributor databases depending on the size and how busy your publications are.
I am not entirely sure about the push subscriptions but for the other stuff
you could have a look at this link:
Setting up Replication on a database that is part of an AlwaysOn Availability Group
SRV1: Original Publisher
SRV2: Publisher Replica
SRV3: Publisher Replica
SRV4: Distributor and Subscriber (You can choose a completely new
server to be the distributor as well, however do not have a
distributor on any of the publishers in this case as the failover of a
distributor is not supported in this case).
maybe this will not apply on your case, but just for reference:
Configure Replication for Always On Availability Groups (SQL Server)
Am I on the right track in saying the most reliable method of
"pausing" replication would be to use sp_MSstopdistribution_agent on
the publisher (for PUSH) and on subscriber (for PULL) before failover
to the (non-distributor) node, then do a sp_MSstartdistribution_agent
after we fail back again?
I would say in most cases yes, I used to use those procedures as you can see (scripts included) on this link:
How to restart the distributor agent of transactional replication?
Now I prefer to use the following method:
-- on the publisher
-- this gives you the distributor server name and distributor database name
sp_helpdistributor
-- now connect to the distributor server and database
-- get the job name
-- get the distributor job name for the publication you want
-- in this example my publication is called 'DEOrder'
use distribution
go
my database and publication are called ATOrder
in the example below.
-- the snapshot agent
select job_name=name, publisher_db, publication from distribution.dbo.mssnapshot_agents
where publisher_db = N'ATOrder'
-- the distributor jon transactional replication
select job_name=name, publisher_db, publication from distribution.dbo.MSlogreader_agents
where publisher_db = N'ATOrder'
-- if it is merge replication
select job_name=name, publisher_db, publication from MSmerge_agents
where publisher_db = N'ATOrder'
-- who are my publishers and who are my subscribers
select * from MSsubscriber_info
-- just checking dbs
select * from MSpublisher_databases
-- get the distributor job (log reader) for a specific publication
select job_name=name, publisher_db, publication from distribution.dbo.MSlogreader_agents
where publisher_db = N'ATOrder'
-- take note of the job name
--my_publication_server_name_and_instance2-ATOrder-25
-- check that the job is running at the moment - it must be!!
sp_runningjobs 'my_publication_server_name_and_instance2-ATOrder-25'
--check I am targeting the right job
exec msdb.dbo.sp_help_job @job_name = 'my_publication_server_name_and_instance2-ATOrder-25'
--stop the job
exec msdb.dbo.sp_stop_job @job_name = 'my_publication_server_name_and_instance2-ATOrder-25'
--start the job
exec msdb.dbo.sp_start_job @job_name = 'my_publication_server_name_and_instance2-ATOrder-25'
for more information, as to see if everything is going on alright, or who and what is getting a bit slow, you can check this link:
Replication Monitor Information using T-SQL
We are using SQL 2014 on both publisher and subscriber.
Lucky you
Best Answer
Thank you for the code - it helps! The encrypted connection information shouldn't have any bearing on this, though.
This can happen for a variety of reasons. There are some subtle nuances that make it seem like it isn't actually up. Let's take a look.
string connetionString = "Trusted_Connection=yes;database=MyDataBse;server=fslistener,1433;Encrypt=True;TrustServerCertificate=True";
The connection string is not setting something called
Connection Timeout
which handles the way a connection is deemed to have timed out or failed. This is in seconds. The default is 15 seconds. Knowing this...Assume SQL Server failed and we needed to fail over. The fastest this application would connect is the time it takes to failover. This is a combination of a few things:
Let's also assume that the cluster is setup with defaults. This means it'll take a minimum of 5 seconds to notice that something is wrong. It may take longer, depending on how the node fails up to a maximum of the SQL Server healthcheck interval (default 30 seconds).
You can already see that finding the failure can take anywhere from 5-30 seconds, by default.
The next step is initiating the failure and arbitrating for resources. This should take a trivial amount of time, a few seconds at most. This is where the listener is moved and becomes responsive to new login attempts.
Additionally there is the time is takes for the databases to redo and become available - especially if there is a default database specified that is part of the AG (versus one that isn't, say
master
). This can take anywhere from near instantaneous to many minutes depending on use and hardware subsystems. Let's assume, for this, it only takes 3 seconds.Adding this all up we have:
This totals out to 10 seconds. Now, remember the Time Out I was talking about earlier. If the application attempts to connect and receives a login timeout, that means we waited 15 seconds and didn't get a response. This doesn't mean we constantly tried for 15 seconds, it means we sent a request and patiently sat there, waiting. Adding in a single timeout wait of 15 seconds now brings our total up to 25 seconds. That's assuming a whole bunch of things.
Wrapping up
In the end, the actual failover time will be held in the SQL Server errorlog by finding the start of the failover and the database online messages. Additionally, extended events can be used to find the actual failover times.
The application may (most likely will) see a longer "downtime" due to the way it is coded. In this instance, a default timeout value of 15 seconds is most likely adding and additional overhead. The code also sleeps for 1 second between retries which makes a total of 16 seconds. The connection string has a default database which means it takes us even longer to connect as the database is required to be online for the connection to be established. It takes time to detect and arbitrate the failover, this needs to be taken into account.
More information: https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlconnection.connectiontimeout(v=vs.110).aspx