We have one particular SQL Server which is intermittently timing out when accepting connections. The issue is consistent throughout the day, but occurs at a very low incidence. How can I continue to troubleshoot?
Connection Timeout Expired. The timeout period elapsed while
attempting to consume the pre-login handshake acknowledgement. This
could be because the pre-login handshake failed or the server was
unable to respond back in time. The duration spent while attempting
to connect to this server was – [Pre-Login] initialization=0;
handshake=15002; (Microsoft SQL Server, Error: -2)
Server Configuration:
- SQL Server 2016 SP1 CU5 Enterprise (issue also occurred prior to SP1)
- Windows Server 2012 R2 on both server and client
- VMware ESXi, 6.5.0 on HP ProLiant DL360 Gen9
- VM has 8 vCPU, 64 GiB of memory (fully reserved)
Test Script (executed once per second):
$failed = $false;
$loginDuration = (Measure-Command {
$ncon = New-Object System.Data.SqlClient.SqlConnection `
@( 'Data Source=1.2.3.4,16143;Database=Test;User=Test;Password=****;Pooling=false;' );
try
{
$ncon.Open();
$cmd = New-Object System.Data.SqlClient.SqlCommand `
@( 'SELECT @@VERSION', $ncon );
$cmd.ExecuteNonQuery();
$ncon.Dispose();
}
catch
{
$failed = $true;
}
}).TotalMilliseconds;
Write-Metric -metric 'itp.dbserver.logintime' -unit 'milliseconds' `
-value (&{if ($failed) { 120000 } else { $loginDuration }});
Observations:
- Issue started occurring after OS Updates, SQL Server Updates, San move, and move from Hyper-V to VMWare
- Most connections succeed (4 failures out of 1,440 attempts)
- Failures are always listed with a low number in "[Pre-Login] initialization=0;" and a high number in "handshake=15002". We do not get errors like "Not found" or "No such host is known", only "Connection Timeout"
- No encryption is enabled for the listener
- Pings show no loss over extended period (0 lost out of 96,045 sent)
- All firewalls are disabled
- Connections attempted using IPv6 and IPv4 addresses fail at the same rate
- CPU is low-ish (<40%)
- Active sessions is persistently around 400
- Balloon driver is disabled
- Connections once made are stable, no unexpected errors when executing queries, no odd disconnections.
- Multiple clients are having problems connecting – both ODBC and ADO from multiple computers
Update: I finally got a client-side Wireshark trace of a failed connection. No packet loss is apparent, client receives TCP ACK's in real-time (<10ms). Client was using DNS name at time of failure, but failure does occur using the IPv4 address in the connection string.
Am I correct in thinking that the fact that I get immediate TCP ACK's to the pre-login request packets sent would localize the issue to the OS or SQL Server?
Best Answer
This was eventually identified as a side-effect of VMWare LRO. Disabling host-based LRO resolved the issue. See