Sql-server – How does SQL Server know predicates are correlated

performancequery-performancesql serversql-server-2008-r2statistics

While diagnosing SQL Server 2008 R2 queries with poor cardinality estimation (despite simple indexing, up-to-date statistics, etc.) and hence poor query plans, I found a perhaps-related KB article:
FIX: Poor performance when you run a query that contains correlated AND predicates in SQL Server 2008 or in SQL Server 2008 R2 or in SQL Server 2012

I can guess what the KB article means by "correlated", e.g. predicate #2 and predicate #1 largely target the same rows.

But I don't know how SQL Server knows about these correlations. Does a table need a multi-column index containing columns from both predicates? Does SQL use statistics to check whether values from one column are correlated to another? Or is some other method used?

I'm asking this for two reasons:

to determine which of my tables and queries might be improved using this hotfix
to know what I should do in indexing, statistics, etc. to affect #1

Best Answer

Consider the simple AdventureWorks query and execution plan shown below. The query contains predicates connected with AND. The optimizer's cardinality estimate is 41,211 rows:

-- Estimate 41,211 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

Default execution plan

Using default statistics

Given only single-column statistics the optimizer produces this estimate by estimating the cardinality for each predicate separately, and multiplying the resulting selectivities together. This heuristic assumes that the predicates are completely independent.

Splitting the query into two parts makes the calculation easier to see:

-- Estimate 68,336.4 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336;

The Transaction History table contains 113,443 rows in total, so the 68,336.4 estimate represents a selectivity of 68336.4 / 113443 = 0.60238533 for this predicate. This estimate is obtained using the histogram information for the TransactionID column, and the constant values specified in the query.

-- Estimate 68,413 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

This predicate has an estimated selectivity of 68413.0 / 113443 = 0.60306056. Again, it is calculated from the predicate's constant values and the histogram of the TransactionDate statistics object.

Assuming the predicates are completely independent, we can estimate the selectivity of the two predicates together by multiplying them together. The final cardinality estimate is obtained by multiplying the resulting selectivity by the 113,443 rows in the base table:

0.60238533 * 0.60306056 * 113443 = 41210.987

After rounding, this is the 41,211 estimate seen in the original query (the optimizer also uses floating point math internally).

Not a great estimate

The TransactionID and TransactionDate columns have a close correlation in the AdventureWorks data set (as monotonically increasing keys and date columns often do). This correlation means that the independence assumption is violated. As a consequence, the post-execution query plan shows 68,095 rows rather than the estimated 41,211:

Post-execution plan

Trace flag 4137

Enabling this trace flag changes the heuristics used to combine predicates. Instead of assuming complete independence, the optimizer considers that the selectivities of the two predicates are close enough that they are likely to be correlated:

-- Estimate 68,336.4
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13'
OPTION (QUERYTRACEON 4137);

Recall that the TransactionID predicate alone estimated 68,336.4 rows and the TransactionDate predicate alone estimated 68,413 rows. The optimizer has chosen the lower of these two estimates rather than multiplying selectivities.

This is just a different heuristic, of course, but one that can help improve estimates for queries with correlated AND predicates. Each predicate is considered for possible correlation, and there are other adjustments made when many AND clauses are involved, but that example serves to show the basics of it.

Multi-column statistics

These can help in queries with correlations, but the histogram information is still based solely on the leading column of the statistics. The following candidate multi-column statistics therefore differ in an important way:

CREATE STATISTICS
    [stats Production.TransactionHistory TransactionID TransactionDate]
ON Production.TransactionHistory
    (TransactionID, TransactionDate);

CREATE STATISTICS
    [stats Production.TransactionHistory TransactionDate TransactionID]
ON Production.TransactionHistory
    (TransactionDate, TransactionID);

Taking just one of those, we can see that the only extra information is the extra levels of the 'all' density. The histogram still only contains detailed information about the TransactionDate column.

DBCC SHOW_STATISTICS
    (
        'Production.TransactionHistory', 
        'stats Production.TransactionHistory TransactionDate TransactionID'
    );

Multi-column statstics

With these multi-column statistics in place...

SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

...the execution plan shows an estimate that is exactly the same as when only single-column statistics were available:

Multi-column statistics plan

Related Solutions

Sql-server – SQL Server Linked Servers and Remote Statistics

What I do is this 'trick':

Give this 'linkuser' read, whatever else needed and ddl_admin privileges, then run the following script to ensure that none of the ddl_admin privileges can actually be used.

But by being technically member of ddl_admin the user can now see statistics in the remote db! (note: from SQL2012SP1 there is no need for this hack, the stats are visible as they should be)

DENY CREATE TABLE TO [LinkUser]
DENY CREATE VIEW TO [LinkUser]
DENY CREATE PROCEDURE TO [LinkUser]
DENY CREATE FUNCTION TO [LinkUser]
DENY CREATE RULE TO [LinkUser]
DENY CREATE DEFAULT TO [LinkUser]
DENY CREATE TYPE TO [LinkUser]
DENY CREATE ASSEMBLY TO [LinkUser]
DENY CREATE XML SCHEMA COLLECTION TO [LinkUser]
DENY CREATE SCHEMA TO [LinkUser]
DENY CREATE SYNONYM TO [LinkUser]
DENY CREATE AGGREGATE TO [LinkUser]
DENY CREATE MESSAGE TYPE TO [LinkUser]
DENY CREATE SERVICE TO [LinkUser]
DENY CREATE CONTRACT TO [LinkUser]
DENY CREATE REMOTE SERVICE BINDING TO [LinkUser]
DENY CREATE ROUTE TO [LinkUser]
DENY CREATE QUEUE TO [LinkUser]
DENY CREATE SYMMETRIC KEY TO [LinkUser]
DENY CREATE ASYMMETRIC KEY TO [LinkUser]
DENY CREATE FULLTEXT CATALOG TO [LinkUser]
DENY CREATE CERTIFICATE TO [LinkUser]
DENY CREATE DATABASE DDL EVENT NOTIFICATION TO [LinkUser]
DENY CHECKPOINT TO [LinkUser]
DENY ALTER ANY SCHEMA TO [LinkUser]
DENY ALTER ANY ASSEMBLY TO [LinkUser]
DENY ALTER ANY DATASPACE TO [LinkUser]
DENY ALTER ANY MESSAGE TYPE TO [LinkUser]
DENY ALTER ANY CONTRACT TO [LinkUser]
DENY ALTER ANY SERVICE TO [LinkUser]
DENY ALTER ANY REMOTE SERVICE BINDING TO [LinkUser]
DENY ALTER ANY ROUTE TO [LinkUser]
DENY ALTER ANY FULLTEXT CATALOG TO [LinkUser]
DENY ALTER ANY SYMMETRIC KEY TO [LinkUser]
DENY ALTER ANY ASYMMETRIC KEY TO [LinkUser]
DENY ALTER ANY CERTIFICATE TO [LinkUser]
DENY REFERENCES TO [LinkUser]
DENY ALTER ANY DATABASE DDL TRIGGER TO [LinkUser]
DENY ALTER ANY DATABASE EVENT NOTIFICATION TO [LinkUser]

Sql-server – Aliased Column name in SQL Server

As @JNK mentioned, you can use a view. There are some steps to take, of course. The view should map only to columns in a single table.

I assume that you do not want to deploy completely updated code that uses the new name. So, you can create a view with a different (but similar) name and any new code could use the view, and the new name. Existing code would still be going directly to the table.

This would get you moving, but one day you will probably want everthing to use the new column name. Eventually, you will have to deploy new code. Which would still mean a number of name changes if you don't want to have the view layer over the table.