Sql-server – SQL Server takes an hour to fail over to secondary node

availability-groupsclusteringfailoversql serversql-server-2008

I am at a new job, we have a 2008 R2 (SP1) WSFC which hosts SQL Server 2008 FCI.

Each node has it's own storage, nodes are 32 core with 32 GB memory (don't ask me why it is configured this way). The storage at each node is a read only copy when the node is secondary. Until today I was told that it was shared storage. I spoke with the sys admin about it because the quorum has a file share witness not a disk witness. He then explained that it does have it's own storage at each site but it is not active on the secondary node. Blocks of data are replicated between the primary and secondary nodes.

What's happening is that the host is failing over (unclear as to why), the most likely culprit is heartbeat timeout.

When the nodes fail over SQL Logs that it was sent a shutdown command.

Sometimes the second (SQL) node comes up within a minute or so.

Other times it takes an hour for SQL Server to come online on the secondary node. On the OS Side the failover is quick, it's just SQL Server that takes and hour.

My thought is that something was running.

I have not found any logs which indicate anything like this though.

In this reactionary state is there anything I can look at within SQL such as a system DB log or table (error logs have nothing after the shutdown command issued until the other instance comes online)?

I've not had this issue before and I have been scouring Google to no avail.

Best Answer

First, I want to address a few things.

Each node has it's own storage, nodes are 32 core with 32 GB memory (don't ask me why it is configured this way). The storage at each node is a read only copy when the node is secondary. Until today I was told that it was shared storage. ... He then explained that it does have it's own storage at each site but it is not active on the secondary node. Blocks of data are replicated between the primary and secondary nodes.

This is block replicated storage and we have no idea how or what is doing it. It could be done through software, it could be done through hardware, it could be a combination of both.

I spoke with the sys admin about it because the quorum has a file share witness not a disk witness.

There is nothing wrong with having a FSW. It's also the most common witness type.



Now to methodology:

What's happening is that the host is failing over (unclear as to why), the most likely culprit is heartbeat timeout.

The cluster logs will tell you why. Generate and go through the cluster log to understand why the failover took place. If, for example, the root cause was failed heartbeats then it would require extra packet logging on each node to understand why the heartbeats were failing.

When the nodes fail over SQL Logs that it was sent a shutdown command.

Yes, so you know the time this was given in the SQL Server errorlog. Thus, the issue had to happen before this in the Cluster Log. This should start your timeline of events which will ultimately lead to the overall time it took for SQL Server to come up.

Sometimes the second (SQL) node comes up within a minute or so.

We need you to define, "comes up".

It's ambiguous whether you mean that the SQL Server service was attempted to be started by the cluster or the database you want to be online and accessible for user queries is online and accessible. I don't know to which you are referring.

In either case, if it's the cluster to bring the SQL Server resource online (service) then that is all in the cluster log of the node that now owns the resource group. It should be trivial to find an online call to the resource as you can quickly search for the name in the log. If it's for the database to be accessible, then you can look at the SQL Server errorlog and see how long it took for the startup messages and the recovery messages. Both should be trivial and can be added to your timeline of events and times.

Other times it takes an hour for SQL Server to come online on the secondary node. On the OS Side the failover is quick, it's just SQL Server that takes and hour.

Again, I'm unsure what you mean by the "OS Side". The call to online the resource is successful and SQL Server starts up but the database isn't accessible or the resource group moves over but the resources aren't started in a timely manner?

I have not found any logs which indicate anything like this though.

AFAIK there is no log that says why a database took a long time to start up. You need to create a timeline of events and extract the data from the multiple sources to build said timeline. Then you can piece together what happened and about how much time each step takes. You can focus in on the long steps and break them down into their individual pieces.

For example, when I do an RCA for these events I start with:

  1. When did the originating issue occur and what was it?
  2. When was SQL Server affected and how did it affect it? Let's assume, there was a failover.
  3. When did the failover occur on the previous owning node?
  4. when did the resource group move over to the next owning node (if not that last one to own it)?
  5. When was the call to online the SQL Server resource first successful and were there any unsuccessful attempts before that?
  6. When did SQL Server log it started the service?
  7. What order were the databases started and at what time?
  8. When did each database finish the recovery phases and how long did that take?
  9. When was the first successful user connection?
  10. Were there any other notable errors, warnings, or informational messages in the logs?

Once you have that overall timeline you can find the pieces that took the longest and investigate further from there.