Sql-server – SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Virtual Enviroment description:

Hypervisor: VMWare
WSFC composed by two nodes with Windows server 2016 Standard
SQL Server AlwaysOn with synchronous replica and automatic failover
SQL Server 2014 (SP3-CU2) 12.0.6214.1
Following the cluster thresholds:
CrossSubnetDelay : 4000
CrossSubnetThreshold : 40
PlumbAllCrossSubnetRoutes : 0
SameSubnetDelay : 2000
SameSubnetThreshold : 20
AG properties:
LeaseTimeout: 20000
FailureConditionLevel: 3
HealthCheckTimeout: 30000
VerboseLogging: 0

Issue:

RHS.exe process on primary node lost heartbeat with availability group and it initiated the failover. Immediately after the lost of heartbeat, primary node loses the connection with the secondary node and automatic failover fail. Shortly after, hearbeat with primary node works again and primary node takes over the resources again.

Log Details

In chronological order:

Primary node randomly shows signs of disconnection with the secondary replica already in the days leading up to failover:
...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
The primary node storage randomly shows signs of distress already in the days leading up to failover and just before failover:
...Long Sync IO: ... IOs in nonpreemptive mode longer than 1000ms
FlushCache cleaned up 146070 bufs with 20273 writes in 80387ms...
SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...
RHS.exe lost heartbeat with AG and WSFC send failover request to AG:
[hadrag] Failure detected, diagnostics heartbeat is lost
The local replica of availability group ... is preparing to transition to the resolving role..
Shortly after the failover request, the primary replica loses connection with the secondary:
...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
The listener stops working and there is a checkpoint failure on a specific database:
One or more recovery units belonging to database .. failed to generate a checkpoint
Eleven seconds after the attempted failover, heartbeat works again and primary replica
takes over the resources

Questions

Million dollar questions:

why heartbeat was lost?
why connection with secondary replica was lost ?
Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

I understand that it is difficult to answer but it can help to only have a strategy or hypotheses that can help me find the problem.

EDIT

I found an interesting example of extended event able to trace the duration of all synchronization steps on tigerteam github. The problem is that it can degrade performance and I haven't found a doc reference for analysing the values shows by extended events. Any idea or suggest for troubleshooting?

Best Answer

why heartbeat was lost? why connection with secondary replica was lost ?

Depending on which instance of DHS.exe it could be Network Problems or a failure in the local SQL Server instance.

Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

Double-check your cluster quorum so that the primary node does not need to communicate with any far-off resources to maintain cluster quorum.

You also have a storage problem that needs to be addressed, but if you use iSCSI or other network-based storage, that might have the same root cause.

Virtual Enviroment description:

Issue:

Log Details

Questions

EDIT

Best Answer

Related Solutions

Sql-server – SQL Server AlwaysOn Error_number 35206

Sql-server – AlwaysOn Secondary Replica DB-size

Related Question