Sql-server – Why does SQL Server estimate less rows will be emitted from a join after inserting some rows

cardinality-estimatessql serversql server 2014

The below is a simplified version of something I came across in production (where the plan got catastrophically worse on a day where an unusually high number of batches were processed).

The repro has been tested against 2014 and 2016 with the new cardinality estimator.

CREATE TABLE T1 (FromDate  DATE, ToDate DATE, SomeId INT, BatchNumber INT);

INSERT INTO T1
SELECT TOP 1000 FromDate = '2017-01-01',
                ToDate = '2017-01-01',
                SomeId = ROW_NUMBER() OVER (ORDER BY @@SPID) -1,
                BatchNumber = 1
FROM   master..spt_values v1

CREATE TABLE T2 (SomeDateTime DATETIME, SomeId INT, INDEX IX(SomeDateTime));

INSERT INTO T2
SELECT TOP 1000000 '2017-01-01',
                   ROW_NUMBER() OVER (ORDER BY @@SPID) %1000
FROM   master..spt_values v1,
       master..spt_values v2

T1 contains 1,000 rows.

The FromDate, ToDate, and BatchNumber are identical in all of them. The only value that differs is SomeId with values between 0 and 999

+------------+------------+--------+-----------+
|  FromDate  |   ToDate   | SomeId | BatchNumber |
+------------+------------+--------+-----------+
| 2017-01-01 | 2017-01-01 |      0 |         1 |
| 2017-01-01 | 2017-01-01 |      1 |         1 |
....
| 2017-01-01 | 2017-01-01 |    998 |         1 |
| 2017-01-01 | 2017-01-01 |    999 |         1 |
+------------+------------+--------+-----------+

T2 contains 1 million rows

but only 1,000 distinct ones. Each repeated 1,000 times as below.

+-------------------------+--------+-------+
|      SomeDateTime       | SomeId | Count |
+-------------------------+--------+-------+
| 2017-01-01 00:00:00.000 |      0 |  1000 |
| 2017-01-01 00:00:00.000 |      1 |  1000 |
...
| 2017-01-01 00:00:00.000 |    998 |  1000 |
| 2017-01-01 00:00:00.000 |    999 |  1000 |
+-------------------------+--------+-------+

Executing the following

SELECT *
FROM   T1
       INNER JOIN T2
               ON CAST(t2.SomeDateTime AS DATE) BETWEEN T1.FromDate AND T1.ToDate
                  AND T1.SomeId = T2.SomeId
WHERE  T1.BatchNumber = 1

Takes about 7 seconds on my machine. The actual and estimated rows are perfect for all operators in the plan.

enter image description here

Now add 3,000 additional batches to T1 (with batch numbers 2 to 3001). These each clone the existing thousand rows for batch number 1

INSERT INTO T1
SELECT T1.FromDate,
       T1.ToDate,
       T1.SomeId,
       Nums.NewBatchNumber
FROM   T1
       CROSS JOIN (SELECT TOP (3000) 1 + ROW_NUMBER() OVER (ORDER BY @@SPID) AS NewBatchNumber
                   FROM   master..spt_values v1, master..spt_values v2) Nums 

and update the statistics for luck

 UPDATE STATISTICS T1 WITH FULLSCAN

And run the original query again.

SELECT *
FROM   T1
       INNER JOIN T2
               ON CAST(t2.SomeDateTime AS DATE) BETWEEN T1.FromDate AND T1.ToDate
                  AND T1.SomeId = T2.SomeId
WHERE  T1.BatchNumber = 1

I let it run for a minute before killing it. By that time it had output 40,380 rows so I guess it would take 25 mins to output the full million.

The only thing that changed is that I added some additional rows not matching the T1.BatchNumber = 1 predicate.

However the plan has now changed. It uses nested loops instead and whilst the number of rows coming out from t1 is still correctly estimated at 1,000 (①) the estimate of the number of joined rows has now dropped from 1 million to a thousand (②).

enter image description here

So the question is…

Why does adding additional rows with BatchNumber <> 1 somehow affect the estimates for rows joined when BatchNumber = 1?

It certainly seems counter intuitive that adding rows to a table should end up reducing the estimated number of rows from the query as a whole.

Best Answer

It's important to remember that you aren't guaranteed consistency as you change queries or the data within the tables. The query optimizer may switch to using a different method of cardinality estimate (such as using density as opposed to histograms) which can make two queries appear to be inconsistent with each other. With that said, it certainly seems like the query optimizer is making an unreasonable choice in your case, so let's dig in.

Your demo is too complicated so I'm going to work off of a simpler example which I believe shows the same behavior. Starting data prep and table definitions:

DROP TABLE dbo.T1 IF EXISTS;
CREATE TABLE dbo.T1 (FromDate DATE, ToDate DATE, SomeId INT);

INSERT INTO dbo.T1 WITH (TABLOCK)
SELECT TOP 1000 NULL, NULL, 1
FROM master..spt_values v1;

DROP TABLE dbo.T2 IF EXISTS;
CREATE TABLE dbo.T2 (SomeDateTime DATETIME, INDEX IX(SomeDateTime));

INSERT INTO dbo.T2 WITH (TABLOCK)
SELECT TOP 2 NULL
FROM master..spt_values v1
CROSS JOIN master..spt_values v2;

Here's the SELECT query to investigate:

SELECT *
FROM T1
INNER JOIN T2 ON t2.SomeDateTime BETWEEN T1.FromDate AND T1.ToDate
WHERE T1.SomeId = 1;

This query is simple enough so that we can work out the formula for the cardinality estimate without any trace flags. However, I'm going to try to use TF 2363 as I go to better illustrate what's going on within the optimizer. It's not clear if I'll be successful.

Define the following variables:

C1 = number of rows in table T1

C2 = number of rows in table T2

S1 = the selectivity of the T1.SomeId filter

My claim is that the cardinality estimate for the above query is as follows:

  1. When C2 >= S1 * C1:

C2 * S1 with a lower bound of S1 * C1

  1. When C2 < S1 * C1:

164.317 * C2 * S1 with an upper bound of S1 * C1

Let's go through some examples, although I'm not going to go through every single one that I tested. For the initial data prep we have:

C1 = 1000

C2 = 2

S1 = 1.0

Therefore, the cardinality estimate should be:

2 * 164.317 = 328.634

The impossible-to-fake screenshot below proves this:

example 1

Using the undocumented trace flag 2363 we can get a few clues about what's going on:

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB2].[dbo].[T1].SomeId

Loaded histogram for column QCOL: [SE_DB2].[dbo].[T1].SomeId from stats with id 2

Selectivity: 1

Stats collection generated: 

  CStCollFilter(ID=3, CARD=1000)

      CStCollBaseTable(ID=1, CARD=1000 TBL: T1)

End selectivity computation

Begin selectivity computation

Input tree:

...

Plan for computation:

  CSelCalcSimpleJoinWithUpperBound (Using base cardinality)

      CSelCalcOneSided (RIGHT)

          CSelCalcCombineFilters_ExponentialBackoff (AND)

              CSelCalcFixedFilter (0.3)

              CSelCalcFixedFilter (0.3)

Selectivity: 0.164317

Stats collection generated: 

  CStCollJoin(ID=4, CARD=328.634 x_jtInner)

      CStCollFilter(ID=3, CARD=1000)

          CStCollBaseTable(ID=1, CARD=1000 TBL: T1)

      CStCollBaseTable(ID=2, CARD=2 TBL: T2)

End selectivity computation

With the new CE we get the usual 16% estimate for a BETWEEN. This is due to exponential backoff with the new 2014 CE. Each inequality has a cardinality estimate of 0.3 so BETWEEN is calculated as 0.3 * sqrt(0.3) = 0.164317. Multiply the 16% selectivity by the number of rows in T2 and T1 and we get our estimate. Seems reasonable enough. Let's bump up the number of rows in T2 to 7. Now we have the following:

C1 = 1000

C2 = 7

S1 = 1.0

Therefore, the cardinality estimate should be 1000 because:

7 * 164.317 = 1150 > 1000

Query plan confirms it:

example 1

We can take another peek with TF 2363 but it looks like the selectivity was adjusted behind the scenes to respect the upper bound. I suspect that CSelCalcSimpleJoinWithUpperBound prevents the cardinality estimate from going above 1000.

Loaded histogram for column QCOL: [SE_DB2].[dbo].[T1].SomeId from stats with id 2

Selectivity: 1

Stats collection generated: 

  CStCollFilter(ID=3, CARD=1000)

      CStCollBaseTable(ID=1, CARD=1000 TBL: T1)

End selectivity computation

Begin selectivity computation

Input tree:

...

Plan for computation:

  CSelCalcSimpleJoinWithUpperBound (Using base cardinality)

      CSelCalcOneSided (RIGHT)

          CSelCalcCombineFilters_ExponentialBackoff (AND)

              CSelCalcFixedFilter (0.3)

              CSelCalcFixedFilter (0.3)

Selectivity: 0.142857

Stats collection generated: 

  CStCollJoin(ID=4, CARD=1000 x_jtInner)

      CStCollFilter(ID=3, CARD=1000)

          CStCollBaseTable(ID=1, CARD=1000 TBL: T1)

      CStCollBaseTable(ID=2, CARD=7 TBL: T2)

Let's bump T2 to 50000 rows. Now we have:

C1 = 1000

C2 = 50000

S1 = 1.0

Therefore, the cardinality estimate should be:

50000 * 1.0 = 50000

The query plan again confirms it. It's much easier to guess at the estimate after you've already figured out the formula:

example 2

TF output:

Loaded histogram for column QCOL: [SE_DB2].[dbo].[T1].SomeId from stats with id 2

Selectivity: 1

Stats collection generated: 

  CStCollFilter(ID=3, CARD=1000)

      CStCollBaseTable(ID=1, CARD=1000 TBL: T1)

...

Plan for computation:

  CSelCalcSimpleJoinWithUpperBound (Using base cardinality)

      CSelCalcOneSided (RIGHT)

          CSelCalcCombineFilters_ExponentialBackoff (AND)

              CSelCalcFixedFilter (0.3)

              CSelCalcFixedFilter (0.3)

Selectivity: 0.001

Stats collection generated: 

  CStCollJoin(ID=4, CARD=50000 x_jtInner)

      CStCollFilter(ID=3, CARD=1000)

          CStCollBaseTable(ID=1, CARD=1000 TBL: T1)

      CStCollBaseTable(ID=2, CARD=50000 TBL: T2)

For this example the exponential backoff appears to be irrelevant:

5000 * 1000 * 0.001 = 50000.

Now let's add 3k rows to T1 with a SomeId value of 0. Code to do so:

INSERT INTO T1 WITH (TABLOCK)
SELECT TOP 3000 NULL, NULL, 0
FROM   master..spt_values v1,
       master..spt_values v2;

UPDATE STATISTICS T1 WITH FULLSCAN;

Now we have:

C1 = 4000

C2 = 50000

S1 = 0.25

Therefore, the cardinality estimate should be:

50000 * 0.25 = 12500

Query plan confirms it:

example 3

This is the same behavior that you called out in the question. I added irrelevant rows to a table and the cardinality estimate decreased. Why did that happen? Pay attention to the bold lines:

Loaded histogram for column QCOL: [SE_DB2].[dbo].[T1].SomeId from stats with id 2

Selectivity: 0.25

Stats collection generated: 

  CStCollFilter(ID=3, CARD=1000)

      CStCollBaseTable(ID=1, CARD=4000 TBL: T1)

End selectivity computation

Begin selectivity computation

Input tree:

...

Plan for computation:

  CSelCalcSimpleJoinWithUpperBound (Using base cardinality)

      CSelCalcOneSided (RIGHT)

          CSelCalcCombineFilters_ExponentialBackoff (AND)

              CSelCalcFixedFilter (0.3)

              CSelCalcFixedFilter (0.3)

Selectivity: 0.00025

Stats collection generated: 

  CStCollJoin(ID=4, CARD=12500 x_jtInner)

      CStCollFilter(ID=3, CARD=1000)

          CStCollBaseTable(ID=1, CARD=4000 TBL: T1)

      CStCollBaseTable(ID=2, CARD=50000 TBL: T2)

End selectivity computation

It seems as if the cardinality estimate for this case was calculated as:

C1 * S1 * C2 * S1 / ( S1 * C1)

Or for this particular example:

4000 * 0.25 * 50000 * 0.25 / (0.25 * 4000) = 12500

The general formula can of course can be simplified to:

C2 * S1

Which is the formula that I claimed above. It seems like there's some cancellation going on that shouldn't be. I would expect the total number of rows in T1 to be relevant to the estimate.

If we insert more rows into T1 we can see the lower bound in action:

INSERT INTO T1 WITH (TABLOCK)
SELECT TOP 997000 NULL, NULL, 0
FROM   master..spt_values v1,
       master..spt_values v2;

UPDATE STATISTICS T1 WITH FULLSCAN;

The cardinality estimate in this case is 1000 rows. I will omit the query plan and the TF 2363 output.

In closing, this behavior is pretty suspicious but I don't know enough to declare if it's a bug or not. My example doesn't match your repro exactly but I believe that I observed the same general behavior. Also I would say that you get a bit lucky with how you chose your initial data. There seems to be a fair amount of guessing going on by the optimizer so I wouldn't get too hung up on the fact that the original query returned 1 million rows which matched the estimate exactly.