Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

cardinality-estimatesperformancequery-performancesql serversql server 2014

I have a query that runs in 800 milliseconds in SQL Server 2012 and takes about 170 seconds in SQL Server 2014. I think that I've narrowed this down to a poor cardinality estimate for the Row Count Spool operator. I've read a bit about spool operators (e.g., here and here), but am still having trouble understanding a few things:

  • Why does this query need a Row Count Spool operator? I don't think it's necessary for correctness, so what specific optimization is it trying to provide?
  • Why does SQL Server estimate that the join to the Row Count Spool operator removes all rows?
  • Is this a bug in SQL Server 2014? If so, I'll file in Connect. But I'd like a deeper understanding first.

Note: I can re-write the query as a LEFT JOIN or add indexes to the tables in order to achieve acceptable performance in both SQL Server 2012 and SQL Server 2014. So this question is more about understanding this specific query and plan in depth and less about how to phrase the query differently.

The slow query

See this Pastebin for a full test script. Here is the specific test query I'm looking at:

-- Prune any existing customers from the set of potential new customers
-- This query is much slower than expected in SQL Server 2014 
SELECT *
FROM #potentialNewCustomers -- 10K rows
WHERE cust_nbr NOT IN (
    SELECT cust_nbr
    FROM #existingCustomers -- 1MM rows
)

SQL Server 2014: The estimated query plan

SQL Server believes that the Left Anti Semi Join to the Row Count Spool will filter the 10,000 rows down to 1 row. For this reason, it selects a LOOP JOIN for the subsequent join to #existingCustomers.

enter image description here

SQL Server 2014: The actual query plan

As expected (by everyone but SQL Server!), the Row Count Spool did not remove any rows. So we are looping 10,000 times when SQL Server expected to loop just once.

enter image description here

SQL Server 2012: The estimated query plan

When using SQL Server 2012 (or OPTION (QUERYTRACEON 9481) in SQL Server 2014), the Row Count Spool does not reduce the estimated # of rows and a hash join is chosen, resulting in a far better plan.

enter image description here

The LEFT JOIN re-write

For reference, here is a way that I may re-write the query in order to achieve good performance in all SQL Server 2012, 2014, and 2016. However, I'm still interested in the specific behavior of the query above and whether it is a bug in the new SQL Server 2014 Cardinality Estimator.

-- Re-writing with LEFT JOIN yields much better performance in 2012/2014/2016
SELECT n.*
FROM #potentialNewCustomers n
LEFT JOIN (SELECT 1 AS test, cust_nbr FROM #existingCustomers) c
    ON c.cust_nbr = n.cust_nbr
WHERE c.test IS NULL

enter image description here

Best Answer

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL) 

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL) 

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       ) 

enter image description here

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.