Sql-server – Connection Timeout Expired without apparent network issue

connectivitysql serversql-server-2016vmware

We have one particular SQL Server which is intermittently timing out when accepting connections. The issue is consistent throughout the day, but occurs at a very low incidence. How can I continue to troubleshoot?

Connection Timeout Expired.  The timeout period elapsed while
attempting to consume the pre-login handshake acknowledgement.  This
could be because the pre-login handshake failed or the server was
unable to respond back in time.  The duration spent while attempting
to connect to this server was – [Pre-Login] initialization=0;
handshake=15002;  (Microsoft SQL Server, Error: -2)

Server Configuration:

  • SQL Server 2016 SP1 CU5 Enterprise (issue also occurred prior to SP1)
  • Windows Server 2012 R2 on both server and client
  • VMware ESXi, 6.5.0 on HP ProLiant DL360 Gen9
  • VM has 8 vCPU, 64 GiB of memory (fully reserved)

Test Script (executed once per second):

$failed = $false;
$loginDuration = (Measure-Command {
    $ncon = New-Object System.Data.SqlClient.SqlConnection `
        @( 'Data Source=1.2.3.4,16143;Database=Test;User=Test;Password=****;Pooling=false;' );
    try 
    {
        $ncon.Open();

        $cmd = New-Object System.Data.SqlClient.SqlCommand `
            @( 'SELECT @@VERSION', $ncon );
        $cmd.ExecuteNonQuery();

        $ncon.Dispose();
    }
    catch
    {
        $failed = $true;
    }
}).TotalMilliseconds;
Write-Metric -metric 'itp.dbserver.logintime' -unit 'milliseconds' `
    -value (&{if ($failed) { 120000 } else { $loginDuration }});

Observations:

  • Issue started occurring after OS Updates, SQL Server Updates, San move, and move from Hyper-V to VMWare
  • Most connections succeed (4 failures out of 1,440 attempts)
  • Failures are always listed with a low number in "[Pre-Login] initialization=0;" and a high number in "handshake=15002". We do not get errors like "Not found" or "No such host is known", only "Connection Timeout"
  • No encryption is enabled for the listener
  • Pings show no loss over extended period (0 lost out of 96,045 sent)
  • All firewalls are disabled
  • Connections attempted using IPv6 and IPv4 addresses fail at the same rate
  • CPU is low-ish (<40%)
  • Active sessions is persistently around 400
  • Balloon driver is disabled
  • Connections once made are stable, no unexpected errors when executing queries, no odd disconnections.
  • Multiple clients are having problems connecting – both ODBC and ADO from multiple computers

Update: I finally got a client-side Wireshark trace of a failed connection. No packet loss is apparent, client receives TCP ACK's in real-time (<10ms). Client was using DNS name at time of failure, but failure does occur using the IPv4 address in the connection string.

Wireshark conversation graph showing server not responding for >15 seconds

Am I correct in thinking that the fact that I get immediate TCP ACK's to the pre-login request packets sent would localize the issue to the OS or SQL Server?