Sql-server – Failover scenario with principal “hard down”

failovermirroringsql serversql-server-2005

My company has a failover pair of SQL Server 2005 instances, which provide database availability for a UL-licensed alarm/call center, where uptime and availability are critical for life safety.

Here's an example line from the connectionStrings of one of the business-critical applications (sanitized for our protection):

<add name="MyCompany.MyApp.Properties.Settings.MyDbConnectionString"
 connectionString="Data Source=Principal-DB;Failover Partner=Mirror-DB;
 Initial Catalog=MyDb; Integrated Security=True; Persist Security Info=True;"
 providerName="System.Data.SqlClient" />

About half an hour ago, this was the scenario as we believe it to be:

  1. DBA reports need for hardware maintenance on Principal-DB server to all users at a specified time, and recommends they close and restart client applications if they experience database connection issues.
  2. DBA fails over from Principal-DB to Mirror-DB. Client applications are unaffected.
  3. DBA powers down Principal-DB server.
  4. Some client applications begin to timeout on pooled connections (as would be expected). Users experiencing these errors close and restart as recommended by the DBA.
  5. Those client applications now fail to start altogether, due to a failure to receive a response from the database.
  6. Hilarity ensues.

Our theory was that, because Principal-DB was not in a state to respond to connection requests at all, when it would normally refuse such connections quickly if it were in the restoring state, the client applications ended up waiting for the entire allotted connection timeout (default, 20 seconds) for the primary server to respond, then returning the timeout error without ever attempting to connect to the listed failover partner.

The quick fix was to push an update to the client application containing an App.config that swapped the Data Source and Failover Partner instances, so Mirror-DB was now the server to which the app attempted to connect first. When we fail back to Principal-DB, we will have to undo this change with another application update.

I need a more permanent fix. This was not our expected behavior for a failover pair, and it cannot be allowed to happen again. There has to be a way to configure the client application so that it will correctly try to connect to the failover partner in this circumstance before returning an error.

Best Answer

I found the problem.

SQL Native Client and the .NET SQL Data Provider should seamlessly connect to a failover partner even if the primary is down; however, the default configuration of both providers is to try both TCP/IP and Named Pipes.

This configuration is fine, and works appropriately, when the servers are still both available but the primary has failed over to the backup, because the network-layer connection will succeed, but the application-level connection will be promptly refused causing the client to try the mirror. If the primary is unresponsive, however, the client will waste the entire default 20-second timeout period just waiting for the primary server to respond, first over TCP/IP, then over Named Pipes (which by itself has a default 20- to 30-second timeout period at the network level), and so in this situation it will return an error without ever having tried the backup mirror.

The solution, according to this TechNet article, is to force the client, one way or another, to connect using only TCP/IP, which will fail fast enough that the client has ample time to try both partners. You can manually configure the data providers to do this, or you can override their default settings case-by-case, by specifying the proper protocol to use in the connection string. The parameter to include that forces TCP/IP is Network=dbmssocn;. On top of that, the SQL client has a "three strikes" approach before it gives up and tries the failover mirror, so the Connection Timeout should be set long enough that it will get through this process and try the failover partner at least once. A timeout of 30 seconds should be adequate.

We tried this fix when our primary DB server was powered down again, and it does work. Now we're in the process of updating the config files of all our in-house apps (we have at least a half-dozen desktop apps and intranet sites that target this failover pair; it's a well-loved stack).