Sql-server – Known Good SQL Logins Failing (Error 18456) Accompanied by “Unable to Reuse a Session” (Error 18056)

sql serverwindows-server

I am losing my patience with this problem.

I did see a couple of other questions here that are similar, but I've already looked at the fix actions that resolved their issues and they did not apply to this scenario.

I'm running SQL Server 2008 R2 with all updates. When this particular issue occurs my servers lose all connection ability to the server and it has to be restarted.

It is very intermittent. Every now and again this SQL server instance will report the error: Login failed for user 'XX'. Reason: Failed to open the database configured in the login object while revalidating the login on the connection. Error: 18456, Severity: 14, State: 46.

I know this is a good login, it works all of the other times, and no one changes the credentials.

The above error is immediately followed by this error:

The client was unable to reuse a session with SPID 150, which had been reset for connection pooling. The failure ID is 46. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

Error: 18456, Severity: 14, State: 46.

When this occurs a few minutes later this is reported also:

Message All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on ASYNC_NETWORK_IO. Process Utilization 0%.

The above message just keeps repeating until I restart the server.

From researching this issue I've increased the maximum worker threads (currently at 1100). The server has 72G memory and 4 vCPUs, running Windows Server 2008 R2.

I don't encounter this issue at all on my other database servers that run the same environment. Just this one. Any help would be greatly appreciated, I'm running out of options.

Best Answer

This is a guess.

It may be network errors. A client process will open a connection to the server, query and then typically close the connection. A few seconds/minutes later it does it again. Opening a connection is an expensive operation, so Connection Pooling was invented. With Connection Pooling, when the client process closes the connection, the connection is not actually closed: the object in the application representing the connection is invalidated and cannot be re-used, but the actual connection to the DB is preserved. When the process tries to open a new connection to the DB, the pooled connections associated with the process are searched for one with the same connection string; if found that connection is re-used avoiding the overhead of creating a new one. If one isn't found then a new one is created from scratch.

If during a query the network burps the actual underlying connection in the pool can be invalidated by the server, but the client end remains in the Pool. Re-using it will come up with an error tat mentions something about 'transport-layer level exceptions'.

The invalidation by the server would be logged by the server. I've never looked at what it logs having always come across these from the client end.

One way to recreate the error is to use SSMS to open a query against a server (could be localhost). Run a simple query and it should work. Stop and restart the SQLServer service and re-run the query. SSMS should give an error.

One way to deal with this is by clearing the Connection Pool. Another way is by stopping and restarting the client application (which effectively clears the pool because the app is now running in a different process).

Here is some code I used in an EF Azure-based app (where transient network errors should be catered for):

internal static class TransientError
{
    // Try a very light-weight SQL command to test the connection
    // Basically costs a round trip to the DB.
    // If an SqlException is returned, clear the connection pool, wait and retry.
    // If another SqlException double the wait until a total of 12 secs has been used.
    // After 12 secs (wait = 6400) return; it's not that transient so let standard error handlers
    // cope with it.
    // If any other exception occurs - don't try to do anything clever, just return and let nature
    // take its course.
    internal static void Try(DbContext context)
    {
        int wait = 800;
        bool retry = true;
        while (retry)
        {
            try
            {
                using (SqlConnection con = new SqlConnection(context.Database.Connection.ConnectionString))
                {
                    con.Open();
                    using (SqlCommand com = new SqlCommand("declare @i int;", con))
                    {
                        com.ExecuteNonQuery();
                    }
                }
                retry = false;
            }
            catch (SqlException)
            {
                Clear(context);
                Thread.Sleep(wait);
            }
            catch (Exception)
            {
                retry = false;
            }
            // break after we reach 6.4 secs (for a total as 12 secs)
            if (wait == 6400)
            {
                retry = false;
            }
            wait = wait * 2;
        }
    }
    private static void Clear(DbContext context)
    {
        // This is a clint side operation so won't cause an exception
        // with a server side error.
        // If it does fail, probably should throw!
        using (SqlConnection con = new SqlConnection(context.Database.Connection.ConnectionString))
        {
            SqlConnection.ClearPool(con);
        }
    }
}

This Try method was called before every EF access. For example:

public Service GetService(string serviceName)
{
    TransientError.Try(this.context);
    Service s = (from svc in this.context.Services
                 where svc.System == serviceName
                 select svc).FirstOrDefault<Service>();
    return s;
}

As i said, a guess. If you try the SSMS experiment above and the error matches exceptions thrown by the client then this is likely what is happening.