Sql-server – Low cardinality estimate in hash match operator

cardinality-estimatesperformancequery-performancesql server

I am trying to fix a performance problem with one of our reporting queries on SQL Server 2008 R2.

I have included the part of the query that is causing low estimate. This part is further joined with other tables. Since the estimate for this one is so low, further joins end up being nested loop and causing the query to run forever.

select n.Transactionid
from nath n
WHERE StatusId = 3 and
Date IS NOT NULL and
NOT EXISTS (SELECT 1 FROM nath 
WHERE Transactionid= n.Transactionid
AND StatusId = 3
AND HistoryId < n.HistoryId)

Estimated plan

The estimate for the hash match is only 1.17 but in reality there are 550K records coming out. Statistics have been updated with full scan.

I ran the exact same query on one of our SQL Server 2014 instances and it produced better results; the estimate was 557K on the hash match operator. I then tried trace flag 9481 to force the old cardinality estimator on 2014 and the estimates were back to 1. So I think the issue is something to do with old CE estimating self joins.

I tried trace flag 4199 on SQL Server 2008 R2 but that did not help.

Actual execution plan

I didn't want the actual tables names to be visible, so I created similar tables with fewer columns and different table and column names. The estimates are slightly off than mentioned above but the bigger problem still persists.

SQL Server 2014 with TF 9481

(I don't have a SQL Server 2008 R2 test environment):

SQL Server 2014

Please let me know if there is anyway to fix this wrong estimate.

Repro

The issue can be simulated with the below script:

    create table nat ( c1 int identity(1,1) primary key,c2 int)
    
    declare @a int=1
    declare  @b int =1
     while @a<10000
     begin  
        set @b=1
         while @b<=5
         begin
                insert into nat select  @a
                set @b=@b+1
         end
     set @a=@a+1
     end

select * from nat a where not exists (select 1 from nat b where b.c2=a.c2
 and b.c1<a.c1) 
 OPTION(QUERYTRACEON 9481); -- estimated no of rows from hash match 1
select * from nat a where not exists (select 1 from nat b where b.c2=a.c2
 and b.c1<a.c1)  -- estimated no of rows from hash match 49995

I have done some testing with the above query on SQL Server 2012 and I am not able to force the new cardinality estimator behaviour with trace flag 4199.
Current testing results:

SQL 2014 – High estimates on Hash match operator
SQL 2014 with TF 9481 – Low(1) estimate
SQL 2012 – Low(1) estimate
SQL 2012 with TF 4199 – Still low estimates

How is it that I am able to replicate old cardinality behavior on 2014, but not able to get new CE estimates on 2012?

Is it that the change is not part of trace flag 4199 and only came about in 2014?

Changing the NOT EXISTS to a left join seems to produce a better estimate.

Best Answer

The question of why one cardinality estimation model produces closer results than the other in this case is actually not that interesting. The original CE estimates that not finding a matching row has a very small probability; the new CE calculates that it is almost certain. Both are 'correct', just based on different modelling assumptions. Fundamentally, multi-column semi joins are tricky to evaluate based on single-column statistical information.

It is much more interesting to think about what the query is trying to do, and how we can write it in a way that is more compatible with the statistical information available to SQL Server.

A key observation is that the query will return row(s) with one value per group. In the case of the original query, that is row(s) with the minimum HistoryId value for each Transactionid. In the repro, it is row(s) with the minimum c1 value for each different value of c2. The NOT EXISTS query is just one way of expressing that requirement.

SQL Server has good statistical information about distinct values (density) so all we need to do is write the query in a way that makes it clear we want one value per group. There are many ways to do this, for example (using your repro):

SELECT * 
FROM dbo.nat AS N
WHERE N.c1 =
(
    SELECT MIN(N2.c1) 
    FROM dbo.nat AS N2
    WHERE N2.c2 = N.c2
);

or, equivalently:

SELECT N.* 
FROM dbo.nat AS N
JOIN
(
    SELECT 
        N.c2,
        MIN(N.c1) AS c1
    FROM dbo.nat AS N
    GROUP BY 
        N.c2
) AS J
    ON J.c2 = N.c2
    AND J.c1 = N.c1;

This produces an exactly correct estimate of 9999 rows in 2008 R2, 2012, and 2014 (both CE models):

With a natural index (which would probably be unique as well):

CREATE INDEX i ON dbo.nat (c2, c1);

The plan is even simpler:

You may not always be able to get this very simple plan form, depending on indexes, and other factors. The point I am making is that using basic grouping and joining operations often gets better results from the optimizer (and its cardinality estimation component) than more complex alternatives.

Final notes to clear some misconceptions in the question: the 'new CE' was introduced in 2014. TF 4199 enables plan-affecting optimizer fixes. TF 9481 specifies the original ('legacy') CE, and is only effective on 2014 and later versions.

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL)

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL)

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       )

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.

Sql-server – Why does LEN() function badly underestimate cardinality in SQL Server 2014

For the legacy CE, I see the estimate is for 3.16228 % of the rows – and that is a "magic number" heuristic used for column = literal predicates (there are other heuristics based on predicate construction – but the LEN wrapped around the column for the legacy CE results matches this guess-framework). You can see examples of this on a post on Selectivity Guesses in absence of Statistics by Joe Sack, and Constant-Constant Comparison Estimation by Ian Jose.

-- Legacy CE: 31622.8 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 9481); -- Legacy CE
GO

Now as for the new CE behavior, it looks like this is now visible to the optimizer (which means we can use statistics). I went through the exercise of looking at the calculator output below, and you can look at the associated auto-generation of stats as a pointer:

-- New CE: 1.00007 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 2312 ); -- New CE
GO

-- View New CE behavior with 2363 (for supported option use XEvents)
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  (QUERYTRACEON 2312, QUERYTRACEON 2363, QUERYTRACEON 3604, RECOMPILE); -- New CE
GO

/*
Loaded histogram for column QCOL:
[tempdb].[dbo].[#customers].cust_nbr from stats with id 2
Using ambient cardinality 1e+006 to combine distinct counts:
  999927
 
Combined distinct count: 999927
Selectivity: 1.00007e-006
Stats collection generated:
  CStCollFilter(ID=2, CARD=1.00007)
      CStCollBaseTable(ID=1, CARD=1e+006 TBL: #customers)
 
End selectivity computation
*/
 
EXEC tempdb..sp_helpstats '#customers';


--Check out AVG_RANGE_ROWS values (for example - plenty of ~ 1)
DBCC SHOW_STATISTICS('tempdb..#customers', '_WA_Sys_00000001_B0368087');
--That's my Stats name yours is subject to change

Unfortunately the logic relies on an estimate of the number of distinct values, which is not adjusted for the effect of the LEN function.

Possible workaround

You can get a trie-based estimate under both CE models by rewriting the LEN as a LIKE:

SELECT COUNT_BIG(*)
FROM #customers AS C
WHERE C.cust_nbr LIKE REPLICATE('_', 6);

Information on Trace Flags used:

2363: shows a lot of information, including statistics being loaded.
3604: prints the output of DBCC commands to the messages tab.