Azure SQL Database – DTU Compute Utilization Maxing Out

azureazure-sql-database

I have an Azure SQL database that has gone from a low DUT baseline between 0-3% (avg) to now maxing out at 100% utilization.

Azure SQL DTU consumption

This is for a LOB application and we need to understand what is causing the utilization to be maxing out like this. The server is on standard with 200 DTUs and not in an elastic pool.

I've looked at Query performance insight in the Azure portal and while there are some long running queries, the top offender represents only 0.3% CPU and data IO of 0.23% over the past week.

Query insights

Can anyone offer a good strategy for diagnosing the issue here?

Best Answer

the top offender represents only 0.3% CPU

A DTU is a package of CPU and IO time, so you most likely have queries performing a lot of read IO (index or table scanning most likely) or write IO.

The sudden change could be because of:

  • A new release of the application that has new SQL activity that hasn't been tested on data scaled/shaped like your production data. Look for long-running statements that consume a lot of IO. The solution here will be either in improving those queries and/or the indexing choices to support them.
  • Significant change in your data size (suddenly or over time) meaning that your common working set no longer conveniently fits in RAM meaning that the buffer pool is thrashing causing extra IO work.
  • Significant change in your data shape (suddenly or over time) meaning that the query planner is having a harder time choosing an optimal path. Refreshing index statistics can sometimes help with this if it has not been done for a large amount of time.
  • If your application allows users to define reports or filters they may have created something that has the same effect as point 1 (you'll probably detect this in the same way).
  • A sudden influx of active users, or existing users suddenly becoming more active. Check your application logs for that.

Or some mix of the above.

You do not state what service level your database is as currently (service tier, service level, is it in a pool, if so the tier/level of the pool and per-DB setting) - it would be a good idea to edit that into your question.

For a short term fix, bump up the DTU allocation for that database. If the load is mainly IO then consider moving from a Standard to a Premium tier (while they don't publicly document exactly how a DTU is composed/judged, a premium DTU is said to have an order of magnitude more IO then a standard one so going from standard-100 to premium-125 can sometimes be far more effective than moving from standard-100 to standard-400 but cost less).

Update after Extra Information in Question

the top offender ... data IO of 0.23%

You screen-grab is of the display sorted by CPU - so the top offender for CPU use is apparently not using much IO. But if you are looking for processes that are causing a lot of IO then you need to sort by IO as that may not also be consuming CPU (your CPU allocation could be sat waiting for the IO to complete).

Though from that display is would be worth checking that the behaviour of those four is as good as it could be, particularly the fourth which is taking an average of 22.5 minutes per execution. That could be due to locking issues hence not showing much CPU or IO activity though in any case it is worth investigation.

over the past week.

Your problem, according to the top graph, only started a couple of days ago so looking at the full week is not likely to be all that meaningful. The performance insights tab allows you to zoom in more finely than that.