Sql-server – Risks of manual failover in Always On Availability Groups

availability-groupsfailoversql server

In my environment, I have SQL 2014 with Always On Availability Group for databases in asynchronous with manual fail over.

We have 1 primary and 2 secondary replicas–secondary1 and sec2.

I have to do failover from primary to secondary1. I know some data loss will happen as it is a forced failover.

My questions:

What will be prerequisite for this?
Can we do in busy hours or after busy hours?
Is there any risk involved during failover.

Appreciate your suggestions as first time I will do failover.

Best Answer

Do yo have a test environment? If you don't, get a $200 free Azure credit and set your environment up there (same service pack level) and try it, it is one of the best way to gain confidence and find edge cases. Nothing can really beat testing a like to like environment except experience. Also check out the official support docs from Microsoft.

Per the doc you'll want to check out several items, including:

To determine the failover readiness of an secondary replica, query the is_failover_ready column in the sys.dm_hadr_database_cluster_states dynamic management view, or look at the Failover Readiness column of the Always On Group Dashboard.

To answer the question though, AlwaysOn failovers are very fast as the service is running on both instances and the listener then points to the new host. Ensure the hosts are connecting to the listener instead of the IP of the machine.

Thus, if tested and running properly in your environment it is much faster than FCIs (Failover Cluster Instances) during busy hours. Note that AlwaysOn only works per database and not per server so you will want to ensure jobs are setup right on the new primary replica, users, permissions/sids, and if you need to maintain databases in sync you will want to ensure that you do that somehow before failing one of them over and not the other. Also, if you use MSDTC for cross database transactions you'll want to be very careful as it can cause irreparable data corruption in many cases where you use cross db transactions within the same instance.

Perhaps Sean will be able to give you more of the issues you might face if he sees your thread.

Related Solutions

Sql-server – AlwaysOn Availability Group Forced Failover

After a forced failover testing with async mode, do I need to rebuild my old primary database?

I'm not sure exactly what you mean by "rebuild" the database, but provided the databases are still in a working condition then you shouldn't need to take any actions like that.

What you're seeing, by performing a forced failover, is by design. If you do a forced failover, you could potentially have failed over to a replica that isn't completely caught up or at the same point-in-time as the primary replica. Because of that, data movement is suspended to the secondary replica(s) from the now primary replica so there is a way to have manual intervention if you are now on a database that is "behind". That behavior that you are seeing is a good thing.

This BOL reference explains it all:

After a forced failover, all secondary databases are suspended. This includes the former primary databases, after the former primary replica comes back online and discovers that it is now a secondary replica. You must manually resume each suspended database individually on each secondary replica.

When a secondary database is resumed, it initiates data synchronization with the corresponding primary database. The secondary database rolls back any log records that were never committed on the new primary database. Therefore, if you are concerned about possible data loss on the post-failover primary databases, you should attempt to create a database snapshot on the suspended databases on one of the synchronous-commit secondary databases.

Please see this BOL reference on how to resume an AG database.

The T-SQL for this would be:

alter database YourDatabaseName
set hadr resume;

NOTE/WARNING/DISCLAIMER: You really need to do some leg work to ensure that you are not causing data loss by resuming data movement. See above, it could be a huge problem. The point of suspending data movement is for this very reason: To manually make sure you can recover as much data as possible. When you resume data movement, that could be irreversible. When you resume data movement after a forced failover, you have to have the words "potential data loss" in the forefront of your mind always.

Sql-server – Manual Failover with AlwaysOn availability group

It looks like the answer is Server 2012 R2. 2012 R2 only includes the file share when there are an even number of nodes. So if DR goes offline then the file share will be included in the quorum until such time that DR comes back online.

It also looks like from testing that if you gracefully shutdown connections on secondary's then the primary stays up. So to run from DR for a planned outage means we could failover gracefully and then shut down the servers at the primary site. Because DR became the primary, it will remain up as the last node standing until such time that quorum is restored (and synchronization is complete)

Best Answer

Related Solutions

Sql-server – AlwaysOn Availability Group Forced Failover

Sql-server – Manual Failover with AlwaysOn availability group

Related Question