Sql-server – Hash join between master/detail tables produces too-low cardinality estimate

cardinality-estimatesexecution-plansql serversql server 2014

When joining a master table to a detail table, how can I encourage SQL Server 2014 to use the cardinality estimate of the larger (detail) table as the cardinality estimate of the join output?

For example, when joining 10K master rows to 100K detail rows, I want SQL Server to estimate the join at 100K rows– the same as the estimated number of detail rows. How should I structure my queries and/or tables and/or indexes to help SQL Server's estimator leverage the fact that every detail row always has a corresponding master row? (Meaning that a join between them should never reduce the cardinality estimate.)

Here's more details. Our database has a master/detail pair of tables: VisitTarget has one row for each sales transaction, and VisitSale has one row for each product in each transaction. It's a one-to-many relationship: one VisitTarget row for an average of 10 VisitSale rows.

The tables look like this: (I'm simplifying to only the relevant columns for this question)

-- "master" table
CREATE TABLE VisitTarget
(
  VisitTargetId int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  SaleDate date NOT NULL,
  StoreId int NOT NULL
  -- other columns omitted for clarity  
);
-- covering index for date-scoped queries
CREATE NONCLUSTERED INDEX IX_VisitTarget_SaleDate 
    ON VisitTarget (SaleDate) INCLUDE (StoreId /*, ...more columns */);

-- "detail" table
CREATE TABLE VisitSale
(
  VisitSaleId int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  VisitTargetId int NOT NULL,
  SaleDate date NOT NULL, -- denormalized; copied from VisitTarget
  StoreId int NOT NULL, -- denormalized; copied from VisitTarget
  ItemId int NOT NULL,
  SaleQty int NOT NULL,
  SalePrice decimal(9,2) NOT NULL
  -- other columns omitted for clarity  
);
-- covering index for date-scoped queries
CREATE NONCLUSTERED INDEX IX_VisitSale_SaleDate 
  ON VisitSale (SaleDate)
  INCLUDE (VisitTargetId, StoreId, ItemId, SaleQty, TotalSalePrice decimal(9,2) /*, ...more columns */
);
ALTER TABLE VisitSale 
  WITH CHECK ADD CONSTRAINT FK_VisitSale_VisitTargetId 
  FOREIGN KEY (VisitTargetId)
  REFERENCES VisitTarget (VisitTargetId);
ALTER TABLE VisitSale
  CHECK CONSTRAINT FK_VisitSale_VisitTargetId;

For performance reasons, we've partially denormalized by copying the most common filtering columns (e.g. SaleDate) from the master table into the each detail table rows, and then we added covering indexes on both tables to better support date-filtered queries. This works great to reduce I/O on when running date-filtered queries, but I think this approach is causing cardinality estimation problems when joining the master and detail tables together.

When we join these two tables, queries look like this:

SELECT vt.StoreId, vt.SomeOtherColumn, Sales = sum(vs.SalePrice*vs.SaleQty)
FROM VisitTarget vt 
    JOIN VisitSale vs on vt.VisitTargetId = vs.VisitTargetId
WHERE
    vs.SaleDate BETWEEN '20170101' and '20171231'
    and vt.SaleDate BETWEEN '20170101' and '20171231'
    -- more filtering goes here, e.g. by store, by product, etc. 

The date filter on the detail table (VisitSale) is redundant. It's there to enable sequential I/O (aka Index Seek operator) on the detail table for queries that are filtered by a date range.

The plan for these kinds of queries looks like this:

enter image description here

An actual plan of a query with the same problem can be found here.

As you can see, the cardinality estimation for the join (the tooltip in the lower-left in the picture) is over 4x too low: 2.1M actual vs. 0.5M estimated. This causes performance issues (e.g. spilling to tempdb), especially when this query is a subquery that's used in a more complex query.

But the row-count estimates for each branch of the join are close to the actual row counts. The top half of the join is 100K actual vs. 164K estimated. The bottom half of the join is 2.1M rows actual vs. 3.7M estimated. Hash bucket distribution also looks good. These observations suggest to me that statistics are OK for each table, and that the problem is the estimation of the join cardinality.

At first I thought that the problem was SQL Server expecting that SaleDate columns in each table are independent, whereas really they are identical. So I tried adding an equality comparison for the Sale dates to the join condition or the WHERE clause, e.g.

ON vt.VisitTargetId = vs.VisitTargetId and vt.SaleDate = vs.SaleDate

or

WHERE vt.SaleDate = vs.SaleDate

This didn't work. It even made cardinality estimates worse! So either SQL Server isn't using that equality hint or something else is the root cause of the problem.

Got any ideas for how to troubleshoot and hopefully fix this cardinality estimation issue? My goal is for the cardinality of the master/detail join to be estimated the same as the estimate for the larger ("detail table") input of the join.

If it matters, we're running SQL Server 2014 Enterprise SP2 CU8 build 12.0.5557.0 on Windows Server. There are no trace flags enabled. Database compatibility level is SQL Server 2014. We see the same behavior on multiple different SQL Servers, so it seems unlikely to be a server-specific problem.

There's an optimization in the SQL Server 2014 Cardinality Estimator that is exactly the behavior I'm looking for:

The new CE, however, uses a simpler algorithm that assumes that there is a one-to-many join association between a large table and a small table. This assumes that each row in the large table matches exactly one row in the small table. This algorithm returns the estimated size of the larger input as the join cardinality.

Ideally I could get this behavior, where the cardinality estimate for the join would be the same as the estimate for the large table, even though my "small" table will still return over 100K rows!

Best Answer

Assuming that no improvement can be gained by doing something to statistics or using the legacy CE, then the most straightforward way around your problem is to change your INNER JOIN to a LEFT OUTER JOIN:

SELECT vt.StoreId, vt.SomeOtherColumn, Sales = sum(vs.SalePrice*vs.SaleQty)
FROM VisitSale vs
    LEFT OUTER JOIN VisitTarget vt on vt.VisitTargetId = vs.VisitTargetId
            AND vt.SaleDate BETWEEN '20170101' and '20171231'
WHERE vs.SaleDate BETWEEN '20170101' and '20171231'

If you have a foreign key between tables, you always filter on the same SaleDate range for both tables, and SaleDate always matches between tables then the results of your query should not change. It may seem odd to use an outer join like this, but think of it as informing the query optimizer that the join to the VisitTarget table will never reduce the number of rows returned by the query. Unfortunately, foreign keys do not change cardinality estimates so sometimes you need to resort to tricks like this. (Microsoft Connect suggestion: Make optimizer estimations more accurate by using metadata).

It's possible that writing the query in this form won't work well depending on what else happens in the query after the join. You could try using a temp table to hold the intermediate results of the result set with the most important cardinality estimate.