SQL Server 2016 – Database Level Health Check Did Not Trigger Failover

availability-groupsfailoverhigh-availabilitysql serversql-server-2016

I'm testing the new database level detection option in 2016 SQL Enterprise SP2 CU7 edition, and it does not seem to be working as expected. We have a 2 node setup, synchronous commit, automatic failover on both nodes. Database level health detection option is checked. On the primary node, I took a drive offline that contained one of the data files of the DB that is in the AG. I ran a select * from a table which read from the missing disk, and got the expected 823 error, which was logged in the error log. I ran it a few times, and the error log recorded the 823 multiple times.

The availability group did NOT fail over as it was supposed to when this happened. I waited for about 3 minutes to see if a failover would occur, and it never did. How can i find out how often the DB level health check routine is set to run? I understand this needs to see the issue in 4 consecutive runs according to this article:
enhanced database level failover

I checked the health check timeout value in the AG, and it was 30 seconds.

I also reviewed the failure condition level on the server and it is set to On CriticalServerErrors, but as I understand it, this setting is completely independent of the database level health check, and either of them should be able to trigger a failover on their own. Is this correct?

The only thing I can think of that is preventing this is the pending timeout in the WSFC manager. this has a value of 3 minutes before it will take the cluster resource offline.

Any idea where else I should be looking for why this did not fail over?

Best Answer

On the primary node, I took a drive offline that contained one of the data files of the DB that is in the AG. I ran a select * from a table which read from the missing disk [...]

Since you're on 2016, the database level health checks are checking that the database is online (which taking secondary file offline won't change) and that we can write to the transaction log. Since both of these are true, your test passes. That's how it works in 2016.

Any idea where else I should be looking for why this did not fail over?

Yes, see above. This was changed to encompass more in 2017.