Sql-server – SQL Server cluster questions please

clusteringsql serversql-server-2008-r2

We have SQL Server 2008 R2 cluster active/active/active (a, b, c) and three instances run on this environment.

We did a test on one of the instances, and failover test.

  1. Manual failover from node c to b. After 3 minutes, the application cannot connect to SQL Server.
  2. Back to node c we are OK and application is happy
  3. Manual failover from node c to a. After 3 minutes, the application cannot connect to SQL Server.
  4. Back to node c we are OK and application is happy

Please give me some helps.

Best Answer

There are a few things that I can think of off the top of my head. You're running a multi-instance failover cluster so in theory I'd expect to see each node to be sized such that at any given point in time it can handle the load of all three instances. Chances are that this is not the case, but maybe it is. Ideally, you'd also have a spare node that can handle failures but that doesn't sound like it's the case here.

There are some configurations that you can check to ensure that you've not set yourself up for failure and the first one I'd check would be to run

sp_configure 'max server memory (MB)'

If your run value is 2147483647 then you've got it set to allow SQL Server to take as much memory as it thinks it needs. This is set per instance so when you have multiple instances trying to consume all available RAM you will get memory pressure.

Having said that (read: actually, start here), you've not given us any other information about what you've done to discover why the application stops responding. Is it just the application that connects to the C node that chokes, or does the original application also not work? This could end up being something as simple as the application connection string is connecting to the IP/DNS name of the C node and not the VIP. If this is the case, then when C is no longer serving SQL Server then you're not going to be able to connect.

Step 1: Ensure the connection strings are actually connecting to the instance/VIP name and not the nodes.

Step 1.5: (Thanks to Thomas Stringer), make sure that you're giving the new instance enough time to actually recover the database. Connect to the instance via SSMS and see if your databases are in recovery.

Step 2: If Step 1 is correct, then get on the node that is running multiple instances and see what's going on. I'd recommend using PerfMon because "Task Manager is a dirty, filthy liar" and looking at metrics for the various subsystems starting with Memory, Network, CPU, and Disk IO. This answer contains much of what you'd need in order to check for resource pressure assuming you have connectivity to the instance and the databases are all fully recovered.