SQL Server 2012 AlwaysOn Cluster – Periodic Connection Failures

availability-groupsclusteringconnection-poolingconnectivitysql-server-2012

Since last wednesday (so a we're dealing with this for a week now) we're getting periodic connection problems on our production database cluster. (An alwayson cluster running on windows 2012, three machines, two sync and one async).

When it starting we noticed the following two messages in the SQL Event log that have been repeating ever since when the problem arise;

  • Error: 18456, Severity: 14, State: 46.
  • Login failed for user 'xxx' Reason: Failed to open the database 'xxx' configured in the login object while revalidating the login on the connection. [ CLIENT: xxx.xxx.xxx.xxx]

We've been trying just about everything and are reaching desperation currently. The things we've tried;

  • Restart our firewalls
  • Restart our switches
  • Failover and reboot the nodes (one by one)
  • Turn off the async node and see if anything changes
  • Turn off all but the primary node to see if anything changes
  • Replace the switch
  • Remove some of the network cables and run on one NCI
  • Remove the teaming from a server and failover to this server

Non of this seems to do anything.

The errors connecting code receives vary from a Timeout when beginning a transaction to a Timeout executing a query. (A query that does not show up in SQL Profiler when looking for queries running longer than 5000ms).

The strangest error however to me is the following;

System.Data.SqlClient.SqlException: Connection Timeout Expired.  The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement.  This could be because the pre-login handshake failed or the server was unable to respond back in time.  The duration spent while attempting to connect to this server was - [Pre-Login] initialization=0; handshake=14991;

Now I in no way consider myself a real DBA, hell I manage the software development department and am actually looking to hire one currently… However in my experience and with my limited knowledge I feel it is very odd to not see any of these queries that are timing out in my profiler…

As a matter of fact, I see no blocking queries or queries that raise my suspicion in any way…

Another thing to add here is that it seems to come in waves… Sometimes everything is fine for 15-25 minutes and then it's bad for ~2-5 minutes. This happens the whole day as well, so not during peak hours or when our normal processes run. (Basically I'm seeing errors from our API's trying to reach the database, very few processes run @ night and non of them this periodically).

It's the whole network experiencing this as well, meaning I can rule out code since we didn't publish any changes nor do we connect to the database on the same fashion. (Some through NHibernate, some ADO.NET, some old ADO)

The biggest changes happened on the network side but our sys admin seems adamant it can't be since non of the changes should affect this and they can't see anything on the switches that are suspicious.

Does anyone here have any idea what we could try / should look at next?

Best Answer

At least in part, this is going to be the answer found so far, and yes, it did take until last night to get anywhere at all with this problem.

Effectively we still don't really trust the network and we have engineers looking into this as we also have other complaints however despite the changes done being 100% revolving around the infrastructure and network the only thing we managed to fix and change so far was on the code end of things.

Effectively we have a piece of software that looks for files being dropped in a folder (hotfolder) and to make sure we never miss anything every ~15 minutes there is a check to see if there are any files there we're not yet processing. We have a lot of these things and they always run fine, nothing was changed in anything either.

Here is the scenario that happened;

  • Day we started having trouble a file was placed in a folder and processed
  • The same day the same file was placed there again

Now the code it self was horrible, it ran for 2-20 minutes depending and has today been rewritten quickly to run in ~20 seconds. There were two massive mistakes in the code that caused this problem;

  • If non of the lines in the file (it was an XML file) were meant to be processed the file would be left there. Triggering a new run every ~15 minutes (so the next problem would happen over and over again indefinitely)
  • Every line would call a stored procedure and use a DataReader to see if it had any result, if it did the call would move to the next line WITHOUT closing the reader

The combination of these two things (and the fact those files have 100s if not 1000s of rows) had our system crippled and on it's knees begging for the end.

Someone at the office blames 'connection pooling' and he might be right, he might not be. If i'm being honest I'm not sure why this would cause our systems to be so horribly affected (and the code causing it not to be by the way, never got an exception or problem on this piece of software. Which made it a pain to find).

So if anyone can explain why this would break our SQL server I'd be deligted, all I know is that ASYNC_NETWORK_IO as a wait type might be the biggest clue but I'm not enough of a DBA to find the culprit quickly. So again if anyone has any ideas/queries on how we could have pin pointed this I would love to give them the solution vote.