SQL Server Availability Group – LeaseTimeout and Slow IO Issues

availability-groupsclusteringsql-server-2016

Our 5 main databases runs on a physical (2 * 8 cores, 512GB, Hypertreading) SQL Server 2016 SP2 Enterprise in a single Availability Gruop and sometimes we get errors that the lease-timeout has expired. My understanding is that if the lease can't get updated there is a system-wide problem.

When I check the output of sp_server_diagnostics (*SQLDIAG*.xel files), in the log folder of the primary replica, around the time of the timeout I always find pending IO operations.

<ioSubsystem ioLatchTimeouts="0" intervalLongIos="0" totalLongIos="1">
<longestPendingRequests>
<pendingRequest duration="26566" filePath="\?\F:\SqlLogs\db1.ldf" offset="80824832" handle="0x8d10" />
<pendingRequest duration="1987" filePath="\?\O:\SqlLogs\db2.ldf" offset="3880740352" handle="0x1330" />
<pendingRequest duration="1093" filePath="\?\O:\SqlLogs\db3.ldf" offset="288143360" handle="0x132c" />
<pendingRequest duration="974" filePath="\?\O:\SqlLogs\db3.ldf" offset="288145408" handle="0x132c" />
<pendingRequest duration="937" filePath="\?\O:\SqlLogs\db3.ldf" offset="288146944" handle="0x132c" />
</longestPendingRequests>
</ioSubsystem>

This is what I find in the clusterlog of the primary replica:

WARN [RES] SQL Server Availability Group: [hadrag] Failed to retrieve data column. Return code -1
ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Resource Alive result 0.
ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Resource Alive result 0.
WARN [RHS] Resource AG_Name IsAlive has indicated failure.

This are the erros in the SQL Server errorlog:

Error: 19407, Severity: 16, State: 1
SQL Server hosting availability
group 'AG_Name' did not receive a process event signal from the
Windows Server Failover Cluster within the lease timeout period.

Error: 19407, Severity: 16, State: 1
The lease between availability
group 'AG_Name' and the Windows Server Failover Cluster has
expired. A connectivity issue occurred between the instance of SQL
Server and the Windows Server Failover Cluster. To determine whether
the availability group is failing over correctly, check the
corresponding availability group resource in the Windows Server
Failover Cluster.

Always On: The local replica of availability group 'AG_Name' is
going offline because either the lease expired or lease renewal
failed. This is an informational message only. No user action is
required.

This is the output from SELECT @@version:

Microsoft SQL Server 2016 (SP2-CU15) (KB4577775) – 13.0.5850.14 (X64)
Sep 17 2020 22:12:45 Copyright (c) Microsoft Corporation Enterprise
Edition: Core-based Licensing (64-bit) on Windows Server 2012 R2
Standard 6.3 (Build 9600: )

In our monitoring there are no signs of high cpu usage. Also no memory dumps are created at the time of the problem.

As a result of this timeout the WSFC-service restarts the cluster resource 'AG_Name'. After that this resource is restarted everything works perfect again.
What I don't understand is: how can slow IO-requests cause a lease-timeout? Can to many pending IO-requests cause a lease-timeout?

Best Answer

how can slow IO-requests cause a lease-timeout? Can to many pending IO-requests cause a lease-timeout?

No, slow I/O requests can't cause a lease timeout directly.

However, if the server is completely overloaded (CPU at 100%), than can cause pending I/O requests and lease timeouts. The default lease timeout is 20 seconds, and your pending I/O is 26 seconds. High CPU or some other server / OS level problem is more likely the issue here.

Another cause is that SQL Server encountered a serious error, and is generating dump files (which causes the process to pause, potentially long enough for WSFC to think the lease timed out).

See the documentation for a couple more possibilities:

OS not responding, low virtual memory, working set paging, generating dump, pegged CPU, WSFC down (loss of quorum)

You should review the SQL Server error log to see if there are dumps being created. If you have monitoring from the time of these incidents, you could also check for maxed out CPU.