SQL Azure: More Intermittent Timeouts

azure-sql-database

(also posted on MSDN forums http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/79058c06-d3c5-4073-8571-c29d1f43f7ae)

Hi guys,

We have a set of 5 online auction systems running on Windows Azure & SQL Azure. Each system consists of a single web worker and one or more web roles. Each system is using ASP.NET MVC 3 and Entity Framework, Repository Pattern and StructureMap.

The worker role is responsible for housekeeping and runs two groups of processes. One group is run every ten seconds, the other every second. Each process will likely run a database query or stored procedure. These are scheduled with Quartz.net

The web role serves the public interface and back office. Among other basic crud functionality, both of these provide screens which, when open, will repeatedly call controller methods which will result in execution of stored procedure read-only queries. The frequency of repetition is about 2-3 seconds per client. A typical use case would be 5 back office windows open, and 25 end user windows open – all hitting the system repeatedly.

For a long time we have been experiencing intermittent SQL timeout errors. Three of the most common ones are:

System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 – An existing connection was forcibly closed by the remote host.)

System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 – The semaphore timeout period has expired.)

System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

The only predictable scenario is during an auction where a specific controller -> sproc starts to timeout during the event (presumably due to load). All other times the errors appear to be completely random and come in singles, two’s, and three’s etc. even during periods of user inactivity. For example the system will go 18 hours without an error and then could be 5 – 10 errors from different housekeeping methods, or perhaps a user logged on and viewed their account.

Other info:

I have tried to run the affected queries/sprocs on SQL Azure using both local SSMS and Azure web-based query tool – all seem to execute quickly, 1 second max. Query plans not showing anything too suspicious although I am by no means a SQL query performance expert, or any other kind of expert for that matter J

We have wrapped all affected areas in Azure SQL Transient Fault Handling Blocks – but as is discussed here http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/7a50985d-92c2-472f-9464-a6591efec4b3, they do not catch timeouts, and according to “Valery M” this is for good reason.

We are not storing any session information in the database, although asp.net membership information is stored in the database.

We use 1 “SQL Azure server instance” which hosts all 5 databases, two for staging and three for production. All 5 systems are generally active at the same time although it is unlikely that more than one will be in a state of live load use at any given time.
All web roles, worker roles and the SQL Azure server reside in the same Azure Geographical Region.

Any thoughts on where we should be looking? Would it help giving each system it's own SQL Azure server? … Failing a solution by ourselves – is it possible to get Microsoft to open a support ticket and take a look under the hood at what’s going on in with our application – how does one go about this?

Thanks in advance.

Ilan

Best Answer

These times where your connection is forced closed is usually an indication of throttling. I have found that the main reason for throttling is from high numbers of outstanding requests within your database. Unfortunately there are no hard limits that can be monitored to identify when this throttling occurs. The good news is that the lower limits have been documented which means you can track when you are getting close to the upper limits.

To give a plug to our Cotega monitoring service, this is one of the main things we track to allow people to be notified when they approach these upper limits. Another useful thing to do when this happens is to do a deep database diagnostics to see what expensive queries, connections, etc are occurring within your database as described on How to Perform SQL Azure Performance & Diagnostics Analysis for SQL Azure.