SQL Server Cluster – Understanding Maximum Failures in Specified Period Setting

clusteringsql-server-2012windows-server

The failover properties of the SQL Server role on my two-node SQL cluster are as follows:

Maximum Failures in the Specified Period: 2

Time period: 6 (hrs)

I would expect that forcing failover (either by moving the services manually or rebooting the owner node) more than twice would cause the resources to stay 'Offline', but this isn't the case. I can freely failover a seemingly infinite amount of times with no impact.

What is the purpose of that property and how do I simulate things correctly to test behaviour?

Best Answer

I would expect that forcing failover (either by moving the services manually or rebooting the owner node) more than twice would cause the resources to stay 'Offline', but this isn't the case.

Correct, that isn't the case. Moving the services manually isn't a failure - you've told it to go ahead and change the owner, this isn't a failure. Rebooting also isn't viewed as a failure as long as it was controlled - for example, I run the shutdown -r command or click start->restart. You're telling the server, "Hey, have a nice controlled reboot - no worries."

What is the purpose of that property and how do I simulate things correctly to test behaviour?

AFAIK the purpose of the property is so you don't ping-pong around servers indefinitely, potentially causing more issues. If you exhaust the number of retries that implicitly assumes that each node was attempted to host the role without success - so why keep flopping around causing more logs to be spewed for no reason?

To simulate - have an actual failure... force shutdown (not cleanly) a machine (blue screen it on purpose - something like NotMyFault). Rip the disks out from under it, something to cause an ACTUAL failure.