While diagnosing SQL Server 2008 R2 queries with poor cardinality estimation (despite simple indexing, up-to-date statistics, etc.) and hence poor query plans, I found a perhaps-related KB article:
FIX: Poor performance when you run a query that contains correlated AND predicates in SQL Server 2008 or in SQL Server 2008 R2 or in SQL Server 2012
I can guess what the KB article means by "correlated", e.g. predicate #2 and predicate #1 largely target the same rows.
But I don't know how SQL Server knows about these correlations. Does a table need a multi-column index containing columns from both predicates? Does SQL use statistics to check whether values from one column are correlated to another? Or is some other method used?
I'm asking this for two reasons:
- to determine which of my tables and queries might be improved using this hotfix
- to know what I should do in indexing, statistics, etc. to affect #1
Best Answer
Consider the simple AdventureWorks query and execution plan shown below. The query contains predicates connected with
AND
. The optimizer's cardinality estimate is 41,211 rows:Using default statistics
Given only single-column statistics the optimizer produces this estimate by estimating the cardinality for each predicate separately, and multiplying the resulting selectivities together. This heuristic assumes that the predicates are completely independent.
Splitting the query into two parts makes the calculation easier to see:
The Transaction History table contains 113,443 rows in total, so the 68,336.4 estimate represents a selectivity of 68336.4 / 113443 = 0.60238533 for this predicate. This estimate is obtained using the histogram information for the
TransactionID
column, and the constant values specified in the query.This predicate has an estimated selectivity of 68413.0 / 113443 = 0.60306056. Again, it is calculated from the predicate's constant values and the histogram of the
TransactionDate
statistics object.Assuming the predicates are completely independent, we can estimate the selectivity of the two predicates together by multiplying them together. The final cardinality estimate is obtained by multiplying the resulting selectivity by the 113,443 rows in the base table:
After rounding, this is the 41,211 estimate seen in the original query (the optimizer also uses floating point math internally).
Not a great estimate
The
TransactionID
andTransactionDate
columns have a close correlation in the AdventureWorks data set (as monotonically increasing keys and date columns often do). This correlation means that the independence assumption is violated. As a consequence, the post-execution query plan shows 68,095 rows rather than the estimated 41,211:Trace flag 4137
Enabling this trace flag changes the heuristics used to combine predicates. Instead of assuming complete independence, the optimizer considers that the selectivities of the two predicates are close enough that they are likely to be correlated:
Recall that the
TransactionID
predicate alone estimated 68,336.4 rows and theTransactionDate
predicate alone estimated 68,413 rows. The optimizer has chosen the lower of these two estimates rather than multiplying selectivities.This is just a different heuristic, of course, but one that can help improve estimates for queries with correlated
AND
predicates. Each predicate is considered for possible correlation, and there are other adjustments made when manyAND
clauses are involved, but that example serves to show the basics of it.Multi-column statistics
These can help in queries with correlations, but the histogram information is still based solely on the leading column of the statistics. The following candidate multi-column statistics therefore differ in an important way:
Taking just one of those, we can see that the only extra information is the extra levels of the 'all' density. The histogram still only contains detailed information about the
TransactionDate
column.With these multi-column statistics in place...
...the execution plan shows an estimate that is exactly the same as when only single-column statistics were available: