Sql-server – SQL SERVER 2014 FCI Cluster taking too long to failover

clusteringsql serversql server 2014

I support a multi-instance, two node failover cluster (active-active). Both instances are running SQL 2014, on Windows Server 2008R2. The nodes each have 3/4 TB of memory and 32 cores (64 with HT). SQL is configured with a Max memory of 350GB for each instance. There are between 60 and 100 databases per instance with 3-5 TB of datafiles in per instance.

I’m having some issues with failover times that I’m trying to resolve.
The issue appears to be centered around shutting down the SQL instance prior to the failover. When a manual failover is performed, we do the following to speed things up to this point.

  1. Checkpoint all databases just prior to failover to get as many
    dirty pages written to disk prior to the beginning of the failover
    as possible. Failovers were much longer prior to adding this step

  2. Manual failover is initiated, SQL goes into SA only mode for 1-2
    minutes and then the SQL instance stops.

  3. It takes another 1-2 minutes to release the memory back to the OS before the failover occurs and the resource comes up on the other node.

If the SQL instance has only limited data in memory the failover happens much more quickly.

We looked at I/O and CPU metrics and don’t see any significant issues.

I’m looking for resources and ideas to help reduce the time it takes for this failover to happen.
Thanks,
-Luke.

Best Answer

The issue appears to be centered around shutting down the SQL instance prior to the failover.

You are correct, because that's exactly what is happening. You're shutting it down nicely on one node, moving resources, and starting it up on another node. The question actually doesn't have anything to do with clustering but with speeding up a clean SQL Server shutdown.

When a manual failover is performed, we do the following to speed things up to this point.

When SQL Server cleanly shuts down, it flushes all of the buffers and asks internal systems to shut themselves down.

So, how do you make this faster? First, this is only really the case when you are cleanly shutting down - which most likely won't happen during a real failure. Secondly, you don't let SQL Server shut down - I would look into using Availability Groups and testing your manual failover times to compare and contrast the differences.

Additionally, no matter what you use, indirect checkpoints are much more effective than traditional.