Sql-server – Hash join between master/detail tables produces too-low cardinality estimate

cardinality-estimatesexecution-plansql serversql server 2014

When joining a master table to a detail table, how can I encourage SQL Server 2014 to use the cardinality estimate of the larger (detail) table as the cardinality estimate of the join output?

For example, when joining 10K master rows to 100K detail rows, I want SQL Server to estimate the join at 100K rows– the same as the estimated number of detail rows. How should I structure my queries and/or tables and/or indexes to help SQL Server's estimator leverage the fact that every detail row always has a corresponding master row? (Meaning that a join between them should never reduce the cardinality estimate.)

Here's more details. Our database has a master/detail pair of tables: VisitTarget has one row for each sales transaction, and VisitSale has one row for each product in each transaction. It's a one-to-many relationship: one VisitTarget row for an average of 10 VisitSale rows.

The tables look like this: (I'm simplifying to only the relevant columns for this question)

-- "master" table
CREATE TABLE VisitTarget
(
  VisitTargetId int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  SaleDate date NOT NULL,
  StoreId int NOT NULL
  -- other columns omitted for clarity  
);
-- covering index for date-scoped queries
CREATE NONCLUSTERED INDEX IX_VisitTarget_SaleDate 
    ON VisitTarget (SaleDate) INCLUDE (StoreId /*, ...more columns */);

-- "detail" table
CREATE TABLE VisitSale
(
  VisitSaleId int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  VisitTargetId int NOT NULL,
  SaleDate date NOT NULL, -- denormalized; copied from VisitTarget
  StoreId int NOT NULL, -- denormalized; copied from VisitTarget
  ItemId int NOT NULL,
  SaleQty int NOT NULL,
  SalePrice decimal(9,2) NOT NULL
  -- other columns omitted for clarity  
);
-- covering index for date-scoped queries
CREATE NONCLUSTERED INDEX IX_VisitSale_SaleDate 
  ON VisitSale (SaleDate)
  INCLUDE (VisitTargetId, StoreId, ItemId, SaleQty, TotalSalePrice decimal(9,2) /*, ...more columns */
);
ALTER TABLE VisitSale 
  WITH CHECK ADD CONSTRAINT FK_VisitSale_VisitTargetId 
  FOREIGN KEY (VisitTargetId)
  REFERENCES VisitTarget (VisitTargetId);
ALTER TABLE VisitSale
  CHECK CONSTRAINT FK_VisitSale_VisitTargetId;

For performance reasons, we've partially denormalized by copying the most common filtering columns (e.g. SaleDate) from the master table into the each detail table rows, and then we added covering indexes on both tables to better support date-filtered queries. This works great to reduce I/O on when running date-filtered queries, but I think this approach is causing cardinality estimation problems when joining the master and detail tables together.

When we join these two tables, queries look like this:

SELECT vt.StoreId, vt.SomeOtherColumn, Sales = sum(vs.SalePrice*vs.SaleQty)
FROM VisitTarget vt 
    JOIN VisitSale vs on vt.VisitTargetId = vs.VisitTargetId
WHERE
    vs.SaleDate BETWEEN '20170101' and '20171231'
    and vt.SaleDate BETWEEN '20170101' and '20171231'
    -- more filtering goes here, e.g. by store, by product, etc.

The date filter on the detail table (VisitSale) is redundant. It's there to enable sequential I/O (aka Index Seek operator) on the detail table for queries that are filtered by a date range.

The plan for these kinds of queries looks like this:

An actual plan of a query with the same problem can be found here.

As you can see, the cardinality estimation for the join (the tooltip in the lower-left in the picture) is over 4x too low: 2.1M actual vs. 0.5M estimated. This causes performance issues (e.g. spilling to tempdb), especially when this query is a subquery that's used in a more complex query.

But the row-count estimates for each branch of the join are close to the actual row counts. The top half of the join is 100K actual vs. 164K estimated. The bottom half of the join is 2.1M rows actual vs. 3.7M estimated. Hash bucket distribution also looks good. These observations suggest to me that statistics are OK for each table, and that the problem is the estimation of the join cardinality.

At first I thought that the problem was SQL Server expecting that SaleDate columns in each table are independent, whereas really they are identical. So I tried adding an equality comparison for the Sale dates to the join condition or the WHERE clause, e.g.

ON vt.VisitTargetId = vs.VisitTargetId and vt.SaleDate = vs.SaleDate

WHERE vt.SaleDate = vs.SaleDate

This didn't work. It even made cardinality estimates worse! So either SQL Server isn't using that equality hint or something else is the root cause of the problem.

Got any ideas for how to troubleshoot and hopefully fix this cardinality estimation issue? My goal is for the cardinality of the master/detail join to be estimated the same as the estimate for the larger ("detail table") input of the join.

If it matters, we're running SQL Server 2014 Enterprise SP2 CU8 build 12.0.5557.0 on Windows Server. There are no trace flags enabled. Database compatibility level is SQL Server 2014. We see the same behavior on multiple different SQL Servers, so it seems unlikely to be a server-specific problem.

There's an optimization in the SQL Server 2014 Cardinality Estimator that is exactly the behavior I'm looking for:

The new CE, however, uses a simpler algorithm that assumes that there is a one-to-many join association between a large table and a small table. This assumes that each row in the large table matches exactly one row in the small table. This algorithm returns the estimated size of the larger input as the join cardinality.

Ideally I could get this behavior, where the cardinality estimate for the join would be the same as the estimate for the large table, even though my "small" table will still return over 100K rows!

Best Answer

Assuming that no improvement can be gained by doing something to statistics or using the legacy CE, then the most straightforward way around your problem is to change your INNER JOIN to a LEFT OUTER JOIN:

SELECT vt.StoreId, vt.SomeOtherColumn, Sales = sum(vs.SalePrice*vs.SaleQty)
FROM VisitSale vs
    LEFT OUTER JOIN VisitTarget vt on vt.VisitTargetId = vs.VisitTargetId
            AND vt.SaleDate BETWEEN '20170101' and '20171231'
WHERE vs.SaleDate BETWEEN '20170101' and '20171231'

If you have a foreign key between tables, you always filter on the same SaleDate range for both tables, and SaleDate always matches between tables then the results of your query should not change. It may seem odd to use an outer join like this, but think of it as informing the query optimizer that the join to the VisitTarget table will never reduce the number of rows returned by the query. Unfortunately, foreign keys do not change cardinality estimates so sometimes you need to resort to tricks like this. (Microsoft Connect suggestion: Make optimizer estimations more accurate by using metadata).

It's possible that writing the query in this form won't work well depending on what else happens in the query after the join. You could try using a temp table to hold the intermediate results of the result set with the most important cardinality estimate.

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL)

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL)

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       )

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.

Sql-server – How to improve estimate of 1 row in a View constrained by DateAdd() against an index

A less comprehensive answer than Aaron's but the core issue is a cardinality estimation bug with DATEADD when using the datetime2 type:

Connect: Incorrect estimate when sysdatetime appear in a dateadd() expression

One workaround is to use GETUTCDATE (which returns datetime):

WHERE CreatedUtc > CONVERT(datetime2(7), DATEADD(DAY, -365, GETUTCDATE()))

Note the conversion to datetime2 must be outside the DATEADD to avoid the bug.

An incorrect cardinality estimation reproduces for me in all versions of SQL Server up to and including 2019 CU8 GDR (build 15.0.4083) when the 70 model cardinality estimator is used.

Aaron Bertrand has written an article about this for SQLPerformance.com:

Performance Surprises and Assumptions : DATEADD()

Best Answer

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Sql-server – How to improve estimate of 1 row in a View constrained by DateAdd() against an index

Related Question