SQL Server – Why Are Two HADR DMVs Reporting Different States?

availability-groupsdmvsql-server-2012

SQL Server 2012 (11.0.5058.0) Enterprise Edition

We have 8 Availability Groups in a 2(HA)+1(DR) cluster and our monitoring DMVs are reporting results that confuse me. 6 Availability Groups are configured for HA and DR, 1 is configured for HA only, and 1 is configured for DR only.

Each of the 6 HA/DR Availability Groups have "SQLB" as a primary and "SQLA" as a secondary (synchronous) HA replica and "SQLC" as a secondary (async) replica.

On both secondaries:

SELECT dhags.group_id, dhags.synchronization_health_desc
FROM sys.dm_hadr_availability_group_states dhags

reports that all Availability Group replication sync health are NOT_HEALTHY and

select replica_id,synchronization_health_desc
from sys.dm_hadr_availability_replica_states

reports that all replicas have a sync health of HEALTHY.

The primary replica reports all Availability Groups and replicas with a sync health of HEALTHY.

While I understand that one reports on replica sync health and the other reports on AG sync health, it seems logical to me that if the more granular (AG) state was not healthy, that would affect the overall health of the broader context (replica). I cannot find MSDN documentation that describes how the health is determined at each level.

Why would the secondaries report NOT_HEALTHY for Availability Group sync health, but HEALTHY for replica sync health, and why does this differ from the primary's report?

Best Answer

Sadly, sys.dm_hadr_availability_replica states is not a reliable indicator of replica health. Here's the Connect item on one of the bugs we've run into where that DMV stops refreshing - note in the comments that log_send_queue_size in the DMV sys.dm_hadr_database_replica_states shows 0 even when there's log data to be sent.

Note that the Connect item is marked as Won't Fix. Sad trombone.