Sql-server – Most efficient way to retrieve a sub-query COUNT grouped by top table

azure-sql-databaseperformancequery-performancesql server

Given the following schema

CREATE TABLE categories
(
    id UNIQUEIDENTIFIER PRIMARY KEY,
    name NVARCHAR(50)
);

CREATE TABLE [group]
(
    id UNIQUEIDENTIFIER PRIMARY KEY
);

CREATE TABLE logger
(
    id UNIQUEIDENTIFIER PRIMARY KEY,
    group_id UNIQUEIDENTIFIER,
    uuid CHAR(17)
);

CREATE TABLE data
(
    id UNIQUEIDENTIFIER PRIMARY KEY,
    logger_uuid CHAR(17),
    category_name NVARCHAR(50),
    recorded_on DATETIME
);

And the following rules

  1. Each data record references a logger and a category
  2. Each logger will always have a group
  3. Each group can have multiple loggers
  4. I only want to count the most recent data recorded

category_name isn't unique per row, it's just a way to associate a given data record under a category, id is really just a surrogate key.

What would be the optimum way to achieve the a resultset like

category_id | logger_group_count
--------------------------------
12345          4
67890          2
.....          ...

i.e. count the no. of groups for each category where a logger has recorded data?

As an initial stab I came up with:

SELECT g.id, COUNT(DISTINCT(a.id)) AS logger_group_count 
FROM categories g
  LEFT OUTER JOIN data d ON d.category_name = g.name
  INNER JOIN logger s ON s.uuid = d.logger_uuid
  INNER JOIN group a ON a.id = s.group_id
GROUP BY g.id

But is extremely slow (~45s), data has 400k+ records – here's the query plan and here's a fiddle to play with.

I want to make sure I'm eeking the most out of the query before I start looking at other things i.e. hardware utilization etc. Azure SQL costs can go up substantially (even though you maybe just need a little more juice off your current tier).

Best Answer

You're on a newer version of SQL Server so the actual plan gives you a lot of information. See the caution sign on the SELECT operator? That means that SQL Server generated a warning which could affect query performance. You should always look at those:

<Warnings>
<PlanAffectingConvert ConvertIssue="Seek Plan" Expression="[s].[logger_uuid]=CONVERT_IMPLICIT(nchar(17),[d].[uuid],0)" />
<PlanAffectingConvert ConvertIssue="Seek Plan" Expression="CONVERT_IMPLICIT(nvarchar(100),[d].[name],0)=[g].[name]" />
</Warnings>

There are two data type conversions caused by your schema. Based on the warnings I suspect that name is actually an NVARCHAR(100) and logger_uuid is an NCHAR(17). The posted table schema in the question may not be correct. You should understand the root cause of why these conversions are happening and fix it. Some types of data type conversions prevent index seeks, lead to cardinality estimate issues, and cause other problems.

Another important thing to check is wait stats. You can see those in the details of the SELECT operator as well. Here's the XML for your wait stats and the time spent by the query:

<WaitStats>
<Wait WaitType="RESOURCE_GOVERNOR_IDLE" WaitTimeMs="49515" WaitCount="3773" />
<Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="57164" WaitCount="2466" />
</WaitStats>
<QueryTimeStats ElapsedTime="67135" CpuTime="10007" />

I'm not a cloud guy but it looks like your query isn't able to fully engage a CPU. That's probably related to your current Azure tier. The query only needed about 10 seconds of CPU when executing but it took 67 seconds. I believe that 50 seconds of that time was spent being throttled and 7 seconds of that time was given to you but used on other queries that were concurrently running. The bad news is that the query is slower than it could be due to your tier. The good news it that any reductions in CPU could lead to a 5X reduction in run time. In other words, if you can get the query to use 1 second of CPU then you might see a runtime of around 5 seconds.

Next you can look at the Actual Time Statistics property in your operator details to see where the CPU time was spent. Your plan uses row mode so the CPU time for an operator is the sum of time spent by that operator as well as its children. This is a relatively simple plan so it doesn't take long to discover that the clustered index scan on logger_data uses 6527 ms of CPU time. The loop join that calls it uses 10006 ms of CPU time, so all of your query's CPU is spent at that step. Another clue that something is going wrong at that step can be found by looking at the thickness of the relative arrows:

thick arrows

A lot of rows are returned from that operator so it's worth looking at detail. Looking at the actual number of rows for the clustered index scan you can see that 14088885 rows were returned and 14100798 rows were read. However, the table cardinality is just 484803 rows. Intuitively that seems pretty inefficient, right? The clustered index scan returns far more than the number of rows in the table. Some other plan with a different join type or access method on the table is likely to be more efficient.

Why did SQL Server read and return so many rows? The clustered index is on the inner side of a nested loop. There are 38 rows returned by the outer side of the loop (the scan on the logger table) so the scan on logger_data executes 38 times. 484803*38 = 18422514 which is pretty close to the number of rows read. So why did SQL Server choose such a plan that feels so inefficient? It even estimates that it'll do 57 scans of the table, so arguably the plan that you got was more efficient than it suspected.

You might have been wondering why there's a TOP operator in your plan. SQL Server introduced a row goal to when creating a query plan for your query. This might be more detail than you want, but the short version is that SQL Server does not always need to return all rows from a clustered index scan. Sometimes it can stop early if it only needs a fixed number of rows and it finds those rows before it reaches the end of the scan. A scan isn't as expensive if it can stop early so the operator cost is discounted by a formula when a row goal is present. In other words, SQL Server expects to scan the clustered index 57 times, but it thinks that it will find the single row that it needs very quickly. It only needs a single row from each scan due to the presence of the TOP operator.

You can make your query faster by encouraging the query optimizer to pick a plan that doesn't scan the logger_data table 38 times. This might be as simple as eliminating the data type conversions. That could allow SQL Server to do an index seek instead of a scan. If not, fix the conversions and create a covering index for the logger_data:

CREATE INDEX IX ON logger_data (category_name, logger_uuid);

The query optimizer chooses a plan based on cost. Adding this index makes it unlikely to get the slow plan which does many scans on logger_data because it'll be cheaper to access the table through an index seek instead of a clustered index scan.

If you aren't able to add the index you can consider adding a query hint to disable the introduction of row goals: USE HINT('DISABLE_OPTIMIZER_ROWGOAL')). You should only do this if you feel comfortable with the concept of row goals and understand them. Adding that hint should result in a different plan, but I can't say how efficient it'll be.