Sql-server – Сardinality estimation of partially covering range predicates

cardinality-estimatesoptimizationquerysql serversql server 2014

At the moment I'm trying to figure out how SQL Server evaluates the cardinality of range predicates that partially cover the histogram step.

On the Internet, at cardinality-estimation-for-and-for-intra-step-statistics-value I came across a similar question and Paul White gave a rather interesting answer to it.

According to Paul's answer, the formulas for estimating cardinality for the predicates >= and > (in this case, I am only interested in the Cardinality estimator model of at least 120) are as follows:

For >:

Cardinality = EQ_ROWS + (AVG_RANGE_ROWS * (F * (DISTINCT_RANGE_ROWS - 1)))

For >=:

Cardinality = EQ_ROWS + (AVG_RANGE_ROWS * ((F * (DISTINCT_RANGE_ROWS - 1)) + 1))

I tested the application of these formulas on the [Production].[TransactionHistory] table of the AdventureWorks2014 database based on the range predicate using the TransactionDate column and datetime range between '20140614' and '20140618'.

The statistics for the histogram step of this range are as follows:

Histogram

According to the formula, I calculated the cardinality for the following query:

SELECT COUNT(1)
FROM [AdventureWorks2014].[Production].[TransactionHistory]
WHERE [TransactionDate] BETWEEN '20140615 00:00:00.000' AND '20140616 00:00:00.000'

The calculation was performed using the following code:

  DECLARE @predStart DATETIME =  '20140615 00:00:00.000'
  DECLARE @predEnd DATETIME = '20140616 00:00:00.000'

  DECLARE @stepStart DATETIME = '20140614 00:00:00.000'
  DECLARE @stepEnd DATETIME = '20140618 00:00:00.000'

  DECLARE @predRange FLOAT = DATEDIFF(ms, @predStart, @predEnd)
  DECLARE @stepRange FLOAT = DATEDIFF(ms, @stepStart, @stepEnd)

  DECLARE @F FLOAT = @predRange / @stepRange;

  DECLARE @avg_range_rows FLOAT = 100.3333
  DECLARE @distinct_range_rows INT = 3
  DECLARE @EQ_ROWS INT = 0

  SELECT @F AS 'F'

  --for new cardinality estimator

  SELECT @EQ_ROWS + @avg_range_rows * (@F * (@distinct_range_rows - 1) + 1) AS [new_card]

After calculating, I got the following results:

enter image description here

According to the formula, it turned out 150.5, but the optimizer estimates the predicate at 225.75 rows, and if you change the upper border of the predicate to ‘20140617’, the optimizer will already evaluate 250.833 rows, while using the formula we get only 200.6666 rows.

Please tell me, how does Cardinality Estimator evaluate in this case, maybe I made a mistake somewhere in my understanding of the quoted formulas?

Best Answer

SQL Server uses different calculations in different situations. Your example is different from the linked Q & A because your range is entirely contained within a step; it does not cross a step boundary. It is also an interval with two ends rather than one. Writing BETWEEN is the same as writing two separate predicates with >= and <=.

Interval with two boundaries, within a single step

The formula is modified to perform linear interpolation within the step for the number of distinct values expected, and reflect that two range endpoints are now specified (and assumed to exist within the histogram step) rather than one.

Using the histogram steps given in the question:

question histogram steps

For the query with BETWEEN '20140615' AND '20140616', the calculation is:

DECLARE
    @Q1 float = CONVERT(float, CONVERT(datetime, '2014-06-15')),
    @Q2 float = CONVERT(float, CONVERT(datetime, '2014-06-16')),
    @K1 float = CONVERT(float, CONVERT(datetime, '2014-06-14')),
    @K2 float = CONVERT(float, CONVERT(datetime, '2014-06-18')),
    @RANGE_ROWS float = 301,
    @DISTINCT_RANGE_ROWS float = 3;

DECLARE
    @S1 float = (@Q1 - @K1) / (@K2 - @K1),
    @S2 float = (@Q2 - @K1) / (@K2 - @K1);

DECLARE
    @F float = @S2 - @S1;

DECLARE
    @AVG_RANGE_ROWS float = @RANGE_ROWS / @DISTINCT_RANGE_ROWS;

SELECT
    @AVG_RANGE_ROWS * ((@F * (@DISTINCT_RANGE_ROWS - 2)) + 2);

...giving 225.75. Changing @Q2 from '20140616' to '20140617' gives a result of 250.833.

Both results match those given in the question.