Sql-server – Most efficient way to retrieve a sub-query COUNT grouped by top table

azure-sql-databaseperformancequery-performancesql server

Given the following schema

CREATE TABLE categories
(
    id UNIQUEIDENTIFIER PRIMARY KEY,
    name NVARCHAR(50)
);

CREATE TABLE [group]
(
    id UNIQUEIDENTIFIER PRIMARY KEY
);

CREATE TABLE logger
(
    id UNIQUEIDENTIFIER PRIMARY KEY,
    group_id UNIQUEIDENTIFIER,
    uuid CHAR(17)
);

CREATE TABLE data
(
    id UNIQUEIDENTIFIER PRIMARY KEY,
    logger_uuid CHAR(17),
    category_name NVARCHAR(50),
    recorded_on DATETIME
);

And the following rules

Each data record references a logger and a category
Each logger will always have a group
Each group can have multiple loggers
I only want to count the most recent data recorded

category_name isn't unique per row, it's just a way to associate a given data record under a category, id is really just a surrogate key.

What would be the optimum way to achieve the a resultset like

category_id | logger_group_count
--------------------------------
12345          4
67890          2
.....          ...

i.e. count the no. of groups for each category where a logger has recorded data?

As an initial stab I came up with:

SELECT g.id, COUNT(DISTINCT(a.id)) AS logger_group_count 
FROM categories g
  LEFT OUTER JOIN data d ON d.category_name = g.name
  INNER JOIN logger s ON s.uuid = d.logger_uuid
  INNER JOIN group a ON a.id = s.group_id
GROUP BY g.id

But is extremely slow (~45s), data has 400k+ records – here's the query plan and here's a fiddle to play with.

I want to make sure I'm eeking the most out of the query before I start looking at other things i.e. hardware utilization etc. Azure SQL costs can go up substantially (even though you maybe just need a little more juice off your current tier).

Best Answer

You're on a newer version of SQL Server so the actual plan gives you a lot of information. See the caution sign on the SELECT operator? That means that SQL Server generated a warning which could affect query performance. You should always look at those:

<Warnings>
<PlanAffectingConvert ConvertIssue="Seek Plan" Expression="[s].[logger_uuid]=CONVERT_IMPLICIT(nchar(17),[d].[uuid],0)" />
<PlanAffectingConvert ConvertIssue="Seek Plan" Expression="CONVERT_IMPLICIT(nvarchar(100),[d].[name],0)=[g].[name]" />
</Warnings>

There are two data type conversions caused by your schema. Based on the warnings I suspect that name is actually an NVARCHAR(100) and logger_uuid is an NCHAR(17). The posted table schema in the question may not be correct. You should understand the root cause of why these conversions are happening and fix it. Some types of data type conversions prevent index seeks, lead to cardinality estimate issues, and cause other problems.

Another important thing to check is wait stats. You can see those in the details of the SELECT operator as well. Here's the XML for your wait stats and the time spent by the query:

<WaitStats>
<Wait WaitType="RESOURCE_GOVERNOR_IDLE" WaitTimeMs="49515" WaitCount="3773" />
<Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="57164" WaitCount="2466" />
</WaitStats>
<QueryTimeStats ElapsedTime="67135" CpuTime="10007" />

I'm not a cloud guy but it looks like your query isn't able to fully engage a CPU. That's probably related to your current Azure tier. The query only needed about 10 seconds of CPU when executing but it took 67 seconds. I believe that 50 seconds of that time was spent being throttled and 7 seconds of that time was given to you but used on other queries that were concurrently running. The bad news is that the query is slower than it could be due to your tier. The good news it that any reductions in CPU could lead to a 5X reduction in run time. In other words, if you can get the query to use 1 second of CPU then you might see a runtime of around 5 seconds.

Next you can look at the Actual Time Statistics property in your operator details to see where the CPU time was spent. Your plan uses row mode so the CPU time for an operator is the sum of time spent by that operator as well as its children. This is a relatively simple plan so it doesn't take long to discover that the clustered index scan on logger_data uses 6527 ms of CPU time. The loop join that calls it uses 10006 ms of CPU time, so all of your query's CPU is spent at that step. Another clue that something is going wrong at that step can be found by looking at the thickness of the relative arrows:

A lot of rows are returned from that operator so it's worth looking at detail. Looking at the actual number of rows for the clustered index scan you can see that 14088885 rows were returned and 14100798 rows were read. However, the table cardinality is just 484803 rows. Intuitively that seems pretty inefficient, right? The clustered index scan returns far more than the number of rows in the table. Some other plan with a different join type or access method on the table is likely to be more efficient.

Why did SQL Server read and return so many rows? The clustered index is on the inner side of a nested loop. There are 38 rows returned by the outer side of the loop (the scan on the logger table) so the scan on logger_data executes 38 times. 484803*38 = 18422514 which is pretty close to the number of rows read. So why did SQL Server choose such a plan that feels so inefficient? It even estimates that it'll do 57 scans of the table, so arguably the plan that you got was more efficient than it suspected.

You might have been wondering why there's a TOP operator in your plan. SQL Server introduced a row goal to when creating a query plan for your query. This might be more detail than you want, but the short version is that SQL Server does not always need to return all rows from a clustered index scan. Sometimes it can stop early if it only needs a fixed number of rows and it finds those rows before it reaches the end of the scan. A scan isn't as expensive if it can stop early so the operator cost is discounted by a formula when a row goal is present. In other words, SQL Server expects to scan the clustered index 57 times, but it thinks that it will find the single row that it needs very quickly. It only needs a single row from each scan due to the presence of the TOP operator.

You can make your query faster by encouraging the query optimizer to pick a plan that doesn't scan the logger_data table 38 times. This might be as simple as eliminating the data type conversions. That could allow SQL Server to do an index seek instead of a scan. If not, fix the conversions and create a covering index for the logger_data:

CREATE INDEX IX ON logger_data (category_name, logger_uuid);

The query optimizer chooses a plan based on cost. Adding this index makes it unlikely to get the slow plan which does many scans on logger_data because it'll be cheaper to access the table through an index seek instead of a clustered index scan.

If you aren't able to add the index you can consider adding a query hint to disable the introduction of row goals: USE HINT('DISABLE_OPTIMIZER_ROWGOAL')). You should only do this if you feel comfortable with the concept of row goals and understand them. Adding that hint should result in a different plan, but I can't say how efficient it'll be.

Related Solutions

Sql-server – Most Efficient Way to Retrieve Date Ranges

This is a hard problem to solve in general, but there are a couple of things we can do to help the optimizer choose a plan. This script creates a table with 10,000 rows with a known pseudo-random distribution of rows to illustrate:

CREATE TABLE dbo.SomeDateTable
(
    Id          INTEGER IDENTITY(1, 1) PRIMARY KEY NOT NULL,
    StartDate   DATETIME NOT NULL,
    EndDate     DATETIME NOT NULL
);
GO
SET STATISTICS XML OFF
SET NOCOUNT ON;
DECLARE
    @i  INTEGER = 1,
    @s  FLOAT = RAND(20120104),
    @e  FLOAT = RAND();

WHILE @i <= 10000
BEGIN
    INSERT dbo.SomeDateTable
        (
        StartDate, 
        EndDate
        )
    VALUES
        (
        DATEADD(DAY, @s * 365, {d '2009-01-01'}),
        DATEADD(DAY, @s * 365 + @e * 14, {d '2009-01-01'})
        )

    SELECT
        @s = RAND(),
        @e = RAND(),
        @i += 1
END

The first question is how to index this table. One option is to provide two indexes on the DATETIME columns, so the optimizer can at least choose whether to seek on StartDate or EndDate.

CREATE INDEX nc1 ON dbo.SomeDateTable (StartDate, EndDate)
CREATE INDEX nc2 ON dbo.SomeDateTable (EndDate, StartDate)

Naturally, the inequalities on both StartDate and EndDate mean that only one column in each index can support a seek in the example query, but this is about the best we can do. We might consider making the second column in each index an INCLUDE rather than a key, but we might have other queries that can perform an equality seek on the leading column and an inequality seek on the second column. Also, we may get better statistics this way. Anyway...

DECLARE
    @StartDateBegin DATETIME = {d '2009-08-01'},
    @StartDateEnd DATETIME = {d '2009-10-15'},
    @EndDateBegin DATETIME = {d '2009-08-05'},
    @EndDateEnd DATETIME = {d '2009-10-22'}

SELECT
    COUNT_BIG(*)
FROM dbo.SomeDateTable AS sdt
WHERE
    sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
    AND sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd

This query uses variables, so in general the optimizer will guess at selectivity and distribution, resulting in a guessed cardinality estimate of 81 rows. In fact, the query produces 2076 rows, a discrepancy that might be important in a more complex example.

On SQL Server 2008 SP1 CU5 or later (or R2 RTM CU1) we can take advantage of the Parameter Embedding Optimization to get better estimates, simply by adding OPTION (RECOMPILE) to the SELECT query above. This causes a compilation just before the batch executes, allowing SQL Server to 'see' the real parameter values and optimize for those. With this change, the estimate improves to 468 rows (though you do need to check the runtime plan to see this). This estimate is better than 81 rows, but still not all that close. The modelling extensions enabled by trace flag 2301 may help in some cases, but not with this query.

The problem is where the rows qualified by the two range searches overlap. One of the simplifying assumptions made in the optimizer's costing and cardinality estimation component is that predicates are independent (so if both have a selectivity of 50%, the result of applying both is assumed to qualify 50% of 50% = 25% of the rows). Where this sort of correlation is a problem, we can often work around it with multi-column and/or filtered statistics. With two ranges with unknown start and end points, this becomes impractical. This is where we sometimes have to resort to rewriting the query to a form that happens to produce a better estimate:

SELECT COUNT(*) FROM
(
    SELECT
        sdt.Id
    FROM dbo.SomeDateTable AS sdt
    WHERE 
        sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
    INTERSECT
    SELECT
        sdt.Id
    FROM dbo.SomeDateTable AS sdt 
    WHERE
        sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd
) AS intersected (id)
OPTION (RECOMPILE)

This form happens to produce a runtime estimate of 2110 rows (versus 2076 actual). Unless you have TF 2301 on, in which case the more advanced modelling techniques see through the trick and produce exactly the same estimate as before: 468 rows.

One day SQL Server might gain native support for intervals. If that comes with good statistical support, developers might dread tuning query plans like this a little less.

Sql-server – SQL to select random mix of rows fairly

A clustered index seek or scan could be improved to a non-clustered index seek or scan which should be more efficient.

Since it looks like your problem is Products, I would see about adding an index which would be covering on that table (or perhaps an indexed view since you already have:

Id ManufacturerId Active MemberPrice

Because some of your other columns don't have prefixes, I can't tell where they come from, but I expect some of them also come from Products, so this might not be feasible to make this index covering.

However, but having Active and MemberPrice in the non-clustered index, this might help. It might be enough to tip the plan in favor of a NCI with a lookup to the clustered index to get the remaining columns (like FamilyImageName)

Best Answer

Related Solutions

Sql-server – Most Efficient Way to Retrieve Date Ranges

Sql-server – SQL to select random mix of rows fairly

Related Question