Sql-server – Replace long GROUP BY list with a subquery

join;sql serversubquery

This is a repost of my question on Stack Overflow. They suggested to ask it here:

I found an online article from 2005, where the author claims, that many devs use GROUP BY wrong, and that you should better replace it with a subquery.

I've tested it on one of my queries, where I need to sort the result of a search by the number of joined entries from another table (more common ones should appear first). My original, classic approach was to join both tables on a common ID, group by each field in the select list and order the result by the count of the sub table.

Now, Jeff Smith from the linked blog claims, that you should better use a subselect, which does all the grouping, and than join to that subselect. Checking the execution plans of both approaches, SSMS states, that the large group by requires 52% of the time and the subselect one 48%, so from a technical standpoint, it seems, that the subselect approach is actually marginally faster. However, the "improved" SQL command seems to generate a more complicated execution plan (in terms of nodes)

What do you think? Can you give me some detail about how to interpret the execution plans in this specific case and which one is generally the preferable option?

SELECT
    a.ID,
    a.ID_AddressType,
    a.Name1,
    a.Name2,
    a.Street,
    a.Number,
    a.ZipCode,
    a.City,
    a.Country
FROM dbo.[Address] a
INNER JOIN CONTAINSTABLE(
    dbo.[Address],
    FullAddress,
    '"ZIE*"',
    5
) s ON a.ID = s.[KEY]
LEFT JOIN dbo.Haul h ON h.ID_DestinationAddress = a.ID
GROUP BY
    a.ID,
    a.ID_AddressType,
    a.Name1,
    a.Name2,
    a.Street,
    a.Number,
    a.ZipCode,
    a.City,
    a.Country,
    s.RANK
ORDER BY s.RANK DESC, COUNT(*) DESC;

SELECT
    a.ID,
    a.ID_AddressType,
    a.Name1,
    a.Name2,
    a.Street,
    a.Number,
    a.ZipCode,
    a.City,
    a.Country
FROM dbo.[Address] a
INNER JOIN CONTAINSTABLE(
    dbo.[Address],
    FullAddress,
    '"ZIE*"',
    5
) s ON a.ID = s.[KEY]
LEFT JOIN (
    SELECT ID_DestinationAddress, COUNT(*) Cnt
    FROM dbo.Haul
    GROUP BY ID_DestinationAddress
) h ON h.ID_DestinationAddress = a.ID
ORDER BY s.RANK DESC, h.Cnt DESC;

enter image description here

Best Answer

If you change the left join with dbo.Haul to a subquery, it will calculate these distinct values of ID_DestinationAddress (Stream Aggregate) and Count them (Compute scalar) directly after getting the data from the scan.

This is what you are seeing in the execution plan:

enter image description here

While, when using the GROUP BY method it is only doing the grouping after data passed through the left join between dbo.Haul and dbo.[Address].

enter image description here

How much better it is will depend on the unique value ratio of dbo.Haul. Less unique values means a better outcome for the second execution plan, since the left join has to process less values.

The other positive result of the second query is that only the uniqueness of ID_DestinationAddress is calculated, not the uniqueness of all the columns as a whole in the group by.

Again, you should test & validate the results for your query, dataset & indexes. One of the ways to test if you are not familiar with execution plans is setting SET STATISTICS IO, TIME ON; before executing the queries and making these runtime stats more readable by pasting them in a tool such as statisticsparser.


Testing

A small test to show what differences in data can do for these queries.

If the dbo.Haul table does not have many matches with the 5 records returned by the FULLTEXT index filtering, the difference is not so big:

Group by query plan

Subquery Query plan

1000 rows could get filtered earlier, but the execution time is around 15ms for both queries anyway on my machine.

Now, if I change my data so these 5 records have many more matches with dbo.Haul on the left join:

The difference between the group by query enter image description here

    <QueryTimeStats CpuTime="1564" ElapsedTime="1566" />

And the Subquery becomes more clear

enter image description here

and the stats:

<QueryTimeStats CpuTime="680" ElapsedTime="690"/>