Windows – Continuing failovers on production server

clusteringwindows

At least a few times a day, my main production server has recently begun failing over. Only, the quorum doesn't move to the other node.

I ran the validation report, and got a lot of info I don't understand. It is an HP ProLiant BL680c G5. Some of the info in the report that I'm wondering about:

Degraded:
HpCISSs
HP MPIO DSM for EVA4x00/6x00/8x00 family of Disk Arrays
Link-Layer Topology Discovery Mapper I/O Driver
Mount Point Manager

Those are just a few. I'm not so worried about "errors", but "degraded" seems to imply that it should be running but isn't doing so well.

System details:
sql - 10.50.4000.0
Windows - Windows NT - 64 Bit

It really is going to be awful if it IS an MPIO issue. That has happened many times, and the DBA team has been accused of modifying the settings! I'm the team lead, and even I have no idea what half of this means (I know what MPIO is, and I recognize all the SAN stuff, but troubleshooting? Nah.)

Interesting new information – right before this began happening, we updated the Firewall Service Module, and rebooted the core switches.

I'm thinking there are settings which don't have the correct defaults?

Logged during the failover:

Cluster Agent: The cluster resource FileServer-(server)(Cluster Disk 1) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource SQL Server Agent has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource SQL Server has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource Analysis Services has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource FileServer-(servername)(Cluster Disk 4) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB]"

it is really odd, because there aren't a lot of error messages. the only real info I have is from the validation report. Disks 1-4 always fail, but not logged in the same order, and then the quorum just stays on the node it is on.

After speaking with the network guy, he thinks that perhaps when the core switch was rebooted, it affected the preferred routes for the NICs. He's going to clean things up a bit this weekend (removing the file shares and recreating them) and we'll see where we are. The adding/removing didn't work. It failed over yesterday evening.

We are using fiber channel to connect between the servers and the storage array. We just patched Windows (using Shavlik) and now we have the same issue. I'm starting to wonder if it is some default setting in MPIO that keeps returning.

Best Answer

I'm not familiar with this particular error but I've encountered situations when a two-node cluster had multiple failovers due to MPIO issues with the SAN LUNs. More often than not, it was resolved by updating HBA drivers.

One other thing to look for is to ensure that the disk dependencies are properly set. The SQL Server service should be dependent on all that disks that are hosting the DB files and the backups as well as the disk with the drive letter acting as the mountpoint host. I've run into a few hosts where a missing disk dependency caused a disk to go offline before SQL could close out the DB files.