Sql-server – Why does SQL Server not do compound column statistics histograms

sql serverstatistics

SQL Server has a thing called "multi-column statistics", but it's not what one think it would mean.

Let's take a look at the following sample table:

CREATE TABLE BadStatistics 
(
    IsArchived BIT NOT NULL,
    Id INT NOT NULL IDENTITY PRIMARY KEY,
    Mystery VARCHAR(200) NOT NULL
);

CREATE NONCLUSTERED INDEX BadIndex 
    ON BadStatistics (IsArchived, Mystery);

With that, two statistics are being created on the two indexes we have:

Stats for BadIndex:

+--------------+----------------+-------------------------+
| All density  | Average Length | Columns                 |
+--------------+----------------+-------------------------+
| 0.5          | 1              | IsArchived              |
+--------------+----------------+-------------------------+
| 4.149378E-06 | 37             | IsArchived, Mystery     |
+--------------+----------------+-------------------------+
| 4.149378E-06 | 41             | IsArchived, Mystery, Id |
+--------------+----------------+-------------------------+

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
| 0            | 0          | 24398   | 0                   | 1              |
+--------------+------------+---------+---------------------+----------------+
| 1            | 0          | 216602  | 0                   | 1              |
+--------------+------------+---------+---------------------+----------------+

Stats for clustered index:

+--------------+----------------+---------+
| All density  | Average Length | Columns |
+--------------+----------------+---------+
| 4.149378E-06 | 4              | Id      |
+--------------+----------------+---------+

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
| 1            | 0          | 1       | 0                   | 1              |
+--------------+------------+---------+---------------------+----------------+
| 240999       | 240997     | 1       | 240997              | 1              |
+--------------+------------+---------+---------------------+----------------+
| 241000       | 0          | 1       | 0                   | 1              |
+--------------+------------+---------+---------------------+----------------+

(I've populated the table with random sample data where about a tenth of the rows are non-archived. I ran a full scan statistics update afterwards.)

Why does the histogram of the two-column stats only use one column? I know that numerous people have written about that it does, but what's the rationale? In this case, it makes the whole histogram a lot less useful, because the first column only ever has two values. Why would stats be arbitrarily restricted like that?

Please note that this question isn't referring to multi-dimensional histograms, which are an entirely different beast. It's about single-dimensional histograms with the single dimension being the tuples containing the respective, multiple columns.

Best Answer

Background

The current SQL Server model only uses single-column histograms and multi-column density information. Single column histograms are used to estimate selectivity for suitable predicates e.g. a = 1 or b > 50. A query with multiple predicates simply combines the individual selectivities (with assumptions) to produce an estimated overall selectivity.

For an example, see my article Cardinality Estimation: Combining Density Statistics

Multi-column density further informs the model by providing weak correlation information for multiple equality predicates and grouping cardinalities for aggregations.

Statistics associated with indexes are an opportunistic add-on to that model: The engine might as well collect (normally full scan) statistics while it is building an index. SQL Server automatically constructs a leading-column histogram and density information for the other keys.

Histograms for non-leading columns in an index may be constructed on demand automatically by the query processor, or in advance using sp_createstats with the @indexonly option (among others).

Multi-column histograms

The assumptions made when combining single-column statistics (as above) may or may not model the reality of the data well enough. In many cases, the options available (exponential backoff, independence, minimum selectivity) produce a 'good enough' estimate.

We also have filtered statistics (and indexes) as a natural solution to low-cardinality leading column indexes such as in the question example. Taking these to the logical extreme leads us closer to the multi-dimensional statistics that the question is not about.

When the available modelling options cannot provide a suitable estimate, a multi-column statistic histogram could indeed give a better selectivity estimate for suitable index predicates, in some cases. There are some difficulties around combining different data types in different columns, but nothing insurmountable.

We would also need a histogram for each level of the index keys (for best results); so for an index on (a, b, c) that would mean histograms on (a, b) and (a, b, c) in addition to the current single-column histogram on (a) alone.

The mechanism used to detect stale statistics would also need to be modified to maintain affected multi-column histograms. These histograms would likely end up being rebuilt more often than single-column statistics, simply because modifications to more columns affect them.

All this adds size, complexity, and maintenance overhead.

Multi-column statistics can be simulated (to a limited extent) using a statistic created on a carefully constructed computed column referencing multiple columns. The query would need to include a predicate on the computed column (or an exact textual match for the underlying formula) to take advantage of that statistic. There are probably only very limited situations where this approach is practical. Nevertheless, it has some of the same implementation issues as automatic multi-column histograms would.

Ultimately, the only people that could say for sure why SQL Server does not support multi-column statistics would be the designers themselves. If you feel you can make a strong case for a product improvement in this area with wide applicability, you could suggest it on Connect or via your normal support channel.

Footnote

In this case, it makes the whole histogram a lot less useful, because the first column only ever has two values

The histogram still provides useful information about the distribution of values in the leading column: When the statistics were built, there were 24,398 rows where IsArchived was false, and 216,602 rows where is was true.

In addition, the statistics object tells us there are (1 / 0.5) = 2 distinct values for IsArchived, (1 / 4.149378E-06) ~= 241000 distinct values for (IsArchived, Mystery) with an average row size of 37 bytes, and there same frequency for (IsArchived, Mystery, Id) with 4 extra bytes per row.

That's all good general-purpose information, which can be combined with statistical information about other columns to produce a selectivity estimate in queries with multiple predicates (as mentioned).