Azure SQL Database – How to Handle CPU Limits Being Hit Too Quickly

azure-sql-database

Background in a nutshell: –
We have a SAAS solution with the following main components.
1. We have a web-portal back-end where administrators can edit data.
2. We have a web API that is called by mobile devices. The mobile devices track or report on students reading progress

Up to now the solution was hosted on virtual servers.
Now we are migrating the solution to the Azure framework so that we can take advantage of the scalability of elastic database pools.
We are using event topics to handle large volumes of posts from the mobile devices when the posts can be processed asynchronously,
but there are some posts that need to be processed synchronously, and we are finding Azure's fabric really slow when it comes to multiple concurrent connections.

An example of the issue: –
So when Azure runs a query like the following: –

SELECT q.Category, COUNT(*)
FROM Question q
JOIN Answer a
ON a.QuestionId = q.QuestionId
GROUP BY q.Category
ORDER BY q.Category

The SQL CPU peaks above 97% in all the following scenarios: –
1. The DTU's are 50 and there is more than one concurrent call.
2. The DTU's are 1500 and there are 5 or more concurrent calls.
3. The DTU's are 4000 and there are 20 or more concurrent calls.

So we opened a support call with Microsoft.
We spent more than a week of investigating things from sql statistics and indexes up to web api pricing tiers.
After all that we still came up with the evidence that the CPU was peaking in the SQL database with the scenarios as outlined above.

This leads to the inevitable "re-write large chunks of your system" kind of argument.

So the underlying issue is that elastic database pools can't seem to perform anywhere near the ability of standard SQL databases.
Also, the stand-alone database's performance doesn't seem to compete with the virtual server's performance.

This is so frustrating because Elastic databases pools were recommended for us for reasons of maintaining performance and adding scalability.
We currently run 700+ customers on one virtual server; and were expecting to create one shard database per customer.
The idea being that we could then scale up from hundreds of customers to tens of thousands of customers.
In reality we are fighting to get the Azure fabric to perform anywhere near the kind of performance we have on virtual servers.
So this question is to ask if there is anyone with significant experience in making Azure perform non-trivial tasks at a reasonable pace?
(preferably without having to re-write large chunks of the system)

Best Answer

For questions like these you need to include the query plan and relevant table structure (what keys and indexes are in place) to get a good answer that is specifically relevant. I recommend that you update the question with that detail.

The first thing that springs to mind is the JOIN to Answer - is there a useful index for the QuestionId column that you are joining on? A common error is to assume that defining a foreign key constraint also defines an index (much like defining a unique constraint does) but this is not the case. Most RDBMSs don't because there are occasions where the index is not needed at all so creating one implicitly would be wasteful of space and unnecessarily add I/O to insert & update operations.

If there is no useful index on Answer.QuestionId then that table will be subject to a scan (and probably a sort too) which if the table is small enough to fit in memory will generally be a CPU-bound operation.

If you are seeing this a lot in your DB with the same pattern of query then it may be that your DB devs are working under the assumption that adding a foreign key also defines an index, possibly because they have a mySQL background as IIRC that does create an index implicitly when an FK is defined.