Sql-server – How does SQL Server know predicates are correlated

performancequery-performancesql serversql-server-2008-r2statistics

While diagnosing SQL Server 2008 R2 queries with poor cardinality estimation (despite simple indexing, up-to-date statistics, etc.) and hence poor query plans, I found a perhaps-related KB article:
FIX: Poor performance when you run a query that contains correlated AND predicates in SQL Server 2008 or in SQL Server 2008 R2 or in SQL Server 2012

I can guess what the KB article means by "correlated", e.g. predicate #2 and predicate #1 largely target the same rows.

But I don't know how SQL Server knows about these correlations. Does a table need a multi-column index containing columns from both predicates? Does SQL use statistics to check whether values from one column are correlated to another? Or is some other method used?

I'm asking this for two reasons:

  1. to determine which of my tables and queries might be improved using this hotfix
  2. to know what I should do in indexing, statistics, etc. to affect #1

Best Answer

Consider the simple AdventureWorks query and execution plan shown below. The query contains predicates connected with AND. The optimizer's cardinality estimate is 41,211 rows:

-- Estimate 41,211 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

Default execution plan

Using default statistics

Given only single-column statistics the optimizer produces this estimate by estimating the cardinality for each predicate separately, and multiplying the resulting selectivities together. This heuristic assumes that the predicates are completely independent.

Splitting the query into two parts makes the calculation easier to see:

-- Estimate 68,336.4 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336;

The Transaction History table contains 113,443 rows in total, so the 68,336.4 estimate represents a selectivity of 68336.4 / 113443 = 0.60238533 for this predicate. This estimate is obtained using the histogram information for the TransactionID column, and the constant values specified in the query.

-- Estimate 68,413 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

This predicate has an estimated selectivity of 68413.0 / 113443 = 0.60306056. Again, it is calculated from the predicate's constant values and the histogram of the TransactionDate statistics object.

Assuming the predicates are completely independent, we can estimate the selectivity of the two predicates together by multiplying them together. The final cardinality estimate is obtained by multiplying the resulting selectivity by the 113,443 rows in the base table:

0.60238533 * 0.60306056 * 113443 = 41210.987

After rounding, this is the 41,211 estimate seen in the original query (the optimizer also uses floating point math internally).

Not a great estimate

The TransactionID and TransactionDate columns have a close correlation in the AdventureWorks data set (as monotonically increasing keys and date columns often do). This correlation means that the independence assumption is violated. As a consequence, the post-execution query plan shows 68,095 rows rather than the estimated 41,211:

Post-execution plan

Trace flag 4137

Enabling this trace flag changes the heuristics used to combine predicates. Instead of assuming complete independence, the optimizer considers that the selectivities of the two predicates are close enough that they are likely to be correlated:

-- Estimate 68,336.4
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13'
OPTION (QUERYTRACEON 4137);

Recall that the TransactionID predicate alone estimated 68,336.4 rows and the TransactionDate predicate alone estimated 68,413 rows. The optimizer has chosen the lower of these two estimates rather than multiplying selectivities.

This is just a different heuristic, of course, but one that can help improve estimates for queries with correlated AND predicates. Each predicate is considered for possible correlation, and there are other adjustments made when many AND clauses are involved, but that example serves to show the basics of it.

Multi-column statistics

These can help in queries with correlations, but the histogram information is still based solely on the leading column of the statistics. The following candidate multi-column statistics therefore differ in an important way:

CREATE STATISTICS
    [stats Production.TransactionHistory TransactionID TransactionDate]
ON Production.TransactionHistory
    (TransactionID, TransactionDate);

CREATE STATISTICS
    [stats Production.TransactionHistory TransactionDate TransactionID]
ON Production.TransactionHistory
    (TransactionDate, TransactionID);

Taking just one of those, we can see that the only extra information is the extra levels of the 'all' density. The histogram still only contains detailed information about the TransactionDate column.

DBCC SHOW_STATISTICS
    (
        'Production.TransactionHistory', 
        'stats Production.TransactionHistory TransactionDate TransactionID'
    );

Multi-column statstics

With these multi-column statistics in place...

SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

...the execution plan shows an estimate that is exactly the same as when only single-column statistics were available:

Multi-column statistics plan