Sql-server – Low cardinality estimate in hash match operator

cardinality-estimatesperformancequery-performancesql server

I am trying to fix a performance problem with one of our reporting queries on SQL Server 2008 R2.

I have included the part of the query that is causing low estimate. This part is further joined with other tables. Since the estimate for this one is so low, further joins end up being nested loop and causing the query to run forever.

select n.Transactionid
from nath n
WHERE StatusId = 3 and
Date IS NOT NULL and
NOT EXISTS (SELECT 1 FROM nath 
WHERE Transactionid= n.Transactionid
AND StatusId = 3
AND HistoryId < n.HistoryId)  

Estimated plan

enter image description here

The estimate for the hash match is only 1.17 but in reality there are 550K records coming out. Statistics have been updated with full scan.

I ran the exact same query on one of our SQL Server 2014 instances and it produced better results; the estimate was 557K on the hash match operator. I then tried trace flag 9481 to force the old cardinality estimator on 2014 and the estimates were back to 1. So I think the issue is something to do with old CE estimating self joins.

I tried trace flag 4199 on SQL Server 2008 R2 but that did not help.

Actual execution plan

I didn't want the actual tables names to be visible, so I created similar tables with fewer columns and different table and column names. The estimates are slightly off than mentioned above but the bigger problem still persists.

SQL Server 2014 with TF 9481

(I don't have a SQL Server 2008 R2 test environment):

enter image description here

SQL Server 2014

enter image description here

Please let me know if there is anyway to fix this wrong estimate.

Repro

The issue can be simulated with the below script:

    create table nat ( c1 int identity(1,1) primary key,c2 int)
    
    declare @a int=1
    declare  @b int =1
     while @a<10000
     begin  
        set @b=1
         while @b<=5
         begin
                insert into nat select  @a
                set @b=@b+1
         end
     set @a=@a+1
     end

select * from nat a where not exists (select 1 from nat b where b.c2=a.c2
 and b.c1<a.c1) 
 OPTION(QUERYTRACEON 9481); -- estimated no of rows from hash match 1
select * from nat a where not exists (select 1 from nat b where b.c2=a.c2
 and b.c1<a.c1)  -- estimated no of rows from hash match 49995

I have done some testing with the above query on SQL Server 2012 and I am not able to force the new cardinality estimator behaviour with trace flag 4199.
Current testing results:

  • SQL 2014 – High estimates on Hash match operator
  • SQL 2014 with TF 9481 – Low(1) estimate
  • SQL 2012 – Low(1) estimate
  • SQL 2012 with TF 4199 – Still low estimates

How is it that I am able to replicate old cardinality behavior on 2014, but not able to get new CE estimates on 2012?

Is it that the change is not part of trace flag 4199 and only came about in 2014?

Changing the NOT EXISTS to a left join seems to produce a better estimate.

Best Answer

The question of why one cardinality estimation model produces closer results than the other in this case is actually not that interesting. The original CE estimates that not finding a matching row has a very small probability; the new CE calculates that it is almost certain. Both are 'correct', just based on different modelling assumptions. Fundamentally, multi-column semi joins are tricky to evaluate based on single-column statistical information.

It is much more interesting to think about what the query is trying to do, and how we can write it in a way that is more compatible with the statistical information available to SQL Server.

A key observation is that the query will return row(s) with one value per group. In the case of the original query, that is row(s) with the minimum HistoryId value for each Transactionid. In the repro, it is row(s) with the minimum c1 value for each different value of c2. The NOT EXISTS query is just one way of expressing that requirement.

SQL Server has good statistical information about distinct values (density) so all we need to do is write the query in a way that makes it clear we want one value per group. There are many ways to do this, for example (using your repro):

SELECT * 
FROM dbo.nat AS N
WHERE N.c1 =
(
    SELECT MIN(N2.c1) 
    FROM dbo.nat AS N2
    WHERE N2.c2 = N.c2
);

or, equivalently:

SELECT N.* 
FROM dbo.nat AS N
JOIN
(
    SELECT 
        N.c2,
        MIN(N.c1) AS c1
    FROM dbo.nat AS N
    GROUP BY 
        N.c2
) AS J
    ON J.c2 = N.c2
    AND J.c1 = N.c1;

This produces an exactly correct estimate of 9999 rows in 2008 R2, 2012, and 2014 (both CE models):

enter image description here

With a natural index (which would probably be unique as well):

CREATE INDEX i ON dbo.nat (c2, c1);

The plan is even simpler:

enter image description here

You may not always be able to get this very simple plan form, depending on indexes, and other factors. The point I am making is that using basic grouping and joining operations often gets better results from the optimizer (and its cardinality estimation component) than more complex alternatives.

Final notes to clear some misconceptions in the question: the 'new CE' was introduced in 2014. TF 4199 enables plan-affecting optimizer fixes. TF 9481 specifies the original ('legacy') CE, and is only effective on 2014 and later versions.