SQL Server Optimization – Why SQL Server Runs Subquery for Each Row

database-tuningoptimizationsql-server-2008-r2subquery

This query runs in ~21 seconds (execution plan):

select 
    a.month
    , count(*) 
from SubqueryTest a 
where a.year = (select max(b.year) from SubqueryTest b)
group by a.month

When the subquery is replaced with a variable, it runs in <1 second (execution plan):

declare @year float
select @year = max(b.year) from SubqueryTest b
select 
    month
    , count(*) 
from SubqueryTest where year = @year group by month

Judging from the execution plan, the "select max…" sub-select is run for each of the millions of rows in "SubqueryTest a:, which is why it takes so long.

My question: Since the sub-select is scalar, deterministic and not correlated, why doesn't the query optimizer do what I did in my second example and run the subquery once, store the result, then use it for the main query? I'm sure there's just a hole in my understanding of SQL Server, but I'd really like help filling it – a couple hours with google haven't helped.

The table is just over 1gb with almost 28 million records:

CREATE TABLE SubqueryTest(
  [pk_id] [int] IDENTITY(1,1) NOT NULL
  , [Year] [float] NULL
  , [Month] [float] NULL PRIMARY KEY CLUSTERED ([pk_id] ASC))

CREATE NONCLUSTERED INDEX idxSubqueryTest ON SubqueryTest ([Year] ASC)

Best Answer

The slow plan isn't calculating the MAX for each row in the outer query.

In fact it never explicitly calculates it at all.

It gives a plan similar to

WITH CTE
     AS (SELECT TOP(1) WITH TIES *
         FROM   SubqueryTest
         WHERE year IS NOT NULL
         ORDER  BY year desc)
SELECT month,
       count(*)
FROM   CTE
GROUP  BY month 

Slow Plan (Estimated Row Counts)

enter image description here

You have a non covering index on year asc so it scans that backwards to get the rows in the first year (shows as a seek because of the implicit IS NOT NULL predicate).

Unfortunately it doesn't seem to differentiate between TOP 1 and TOP 1 WITH TIES when estimating row counts.

In this case it makes a huge difference. (estimated 2 key lookup vs actual 4,424,803) so you get an inappropriate plan.

Slow Plan (Actual Row Counts)

enter image description here

You could consider adding month into the index on year either as a key or included column to make the index covering. The benefit of adding it as a secondary key column would be that it could then feed into a stream aggregate without an additional sort (though for only 12 distinct values a hash aggregate would be fine anyway).

A non covering index on such a non selective column is really pretty useless for the vast majority of queries. The index is totally ignored by the "fast" plan which ends up doing a parallel scan on the whole table and evaluating the predicate on all 27,445,400 rows (in preference to performing the huge number of lookups).

enter image description here