Sql-server – SQL SERVER 2014 FCI Cluster taking too long to failover

clusteringsql serversql server 2014

I support a multi-instance, two node failover cluster (active-active). Both instances are running SQL 2014, on Windows Server 2008R2. The nodes each have 3/4 TB of memory and 32 cores (64 with HT). SQL is configured with a Max memory of 350GB for each instance. There are between 60 and 100 databases per instance with 3-5 TB of datafiles in per instance.

I’m having some issues with failover times that I’m trying to resolve.
The issue appears to be centered around shutting down the SQL instance prior to the failover. When a manual failover is performed, we do the following to speed things up to this point.

Checkpoint all databases just prior to failover to get as many
dirty pages written to disk prior to the beginning of the failover
as possible. Failovers were much longer prior to adding this step
Manual failover is initiated, SQL goes into SA only mode for 1-2
minutes and then the SQL instance stops.
It takes another 1-2 minutes to release the memory back to the OS before the failover occurs and the resource comes up on the other node.

If the SQL instance has only limited data in memory the failover happens much more quickly.

We looked at I/O and CPU metrics and don’t see any significant issues.

I’m looking for resources and ideas to help reduce the time it takes for this failover to happen.
Thanks,
-Luke.

Best Answer

The issue appears to be centered around shutting down the SQL instance prior to the failover.

You are correct, because that's exactly what is happening. You're shutting it down nicely on one node, moving resources, and starting it up on another node. The question actually doesn't have anything to do with clustering but with speeding up a clean SQL Server shutdown.

When a manual failover is performed, we do the following to speed things up to this point.

When SQL Server cleanly shuts down, it flushes all of the buffers and asks internal systems to shut themselves down.

So, how do you make this faster? First, this is only really the case when you are cleanly shutting down - which most likely won't happen during a real failure. Secondly, you don't let SQL Server shut down - I would look into using Availability Groups and testing your manual failover times to compare and contrast the differences.

Additionally, no matter what you use, indirect checkpoints are much more effective than traditional.

Related Solutions

Sql-server – SQL Server cluster failover

Failover clustering works to provide constant service in the event of a failure (defining a failure as an "abnormal termination of the previously active application, server, system, or network" - http://en.wikipedia.org/wiki/Failover). Manual manipulation (start, restart, pause, and stop) upon the service being clustered (SQL Server, File Services, etc.) does not qualify as an abnormal termination of the service.

If I recall, at least in Windows Server 2008, you can simulate a failover test case in the Failover Management Console under Administration Tools. Check this out for a list of testing methods: http://blogs.technet.com/b/vipulshah/archive/2009/06/17/failover-cluster-testing-methods.aspx

Sql-server – SQL Server cluster questions please

There are a few things that I can think of off the top of my head. You're running a multi-instance failover cluster so in theory I'd expect to see each node to be sized such that at any given point in time it can handle the load of all three instances. Chances are that this is not the case, but maybe it is. Ideally, you'd also have a spare node that can handle failures but that doesn't sound like it's the case here.

There are some configurations that you can check to ensure that you've not set yourself up for failure and the first one I'd check would be to run

sp_configure 'max server memory (MB)'

If your run value is 2147483647 then you've got it set to allow SQL Server to take as much memory as it thinks it needs. This is set per instance so when you have multiple instances trying to consume all available RAM you will get memory pressure.

Having said that (read: actually, start here), you've not given us any other information about what you've done to discover why the application stops responding. Is it just the application that connects to the C node that chokes, or does the original application also not work? This could end up being something as simple as the application connection string is connecting to the IP/DNS name of the C node and not the VIP. If this is the case, then when C is no longer serving SQL Server then you're not going to be able to connect.

Step 1: Ensure the connection strings are actually connecting to the instance/VIP name and not the nodes.

Step 1.5: (Thanks to Thomas Stringer), make sure that you're giving the new instance enough time to actually recover the database. Connect to the instance via SSMS and see if your databases are in recovery.

Step 2: If Step 1 is correct, then get on the node that is running multiple instances and see what's going on. I'd recommend using PerfMon because "Task Manager is a dirty, filthy liar" and looking at metrics for the various subsystems starting with Memory, Network, CPU, and Disk IO. This answer contains much of what you'd need in order to check for resource pressure assuming you have connectivity to the instance and the databases are all fully recovered.

Best Answer

Related Solutions

Sql-server – SQL Server cluster failover

Sql-server – SQL Server cluster questions please

Related Question