Sql-server – Statistics and row estimation

cardinality-estimatessql serversql-server-2016

Hope someone can help me to learn how SQL Server uses stats to estimate the number of records.

Test Scripts

USE [tempdb]

GO

CREATE TABLE t1
(
a INT NOT NULL, 
b INT NOT NULL, 
c INT CHECK (c between 1 and 50), 
CONSTRAINT pk_a primary key(a)
);

GO

INSERT INTO t1(a,b,c) 
SELECT number, number%1000+1, number%50+1 
FROM master..spt_values 
WHERE type = 'P' AND number BETWEEN 1 and 1000;

GO

CREATE STATISTICS s_b ON t1(b);
CREATE STATISTICS s_c ON t1(c);

GO

Sample Query

DECLARE @c INT=300 
SELECT * FROM t1 WHERE b>@c 
SELECT * FROM t1 WHERE b>300

Query 1

Estimated Number of rows 300

Query 2

Estimated Number of rows 700

Questions

  1. Why such a huge difference between 1st and 2nd query row estimation.
  2. 2nd Query how SQL estimating to 700 from the Stats
  3. Any good article to learn these.

Best Answer

The queries give different estimates because the first one does not benefit from the Parameter Embedding Optimization. From the point of view of the optimizer the query might as well be:

SELECT * FROM t1 WHERE b>???; 

To get an idea of how cardinality estimates might work for simple queries, sometimes I think about what I would do if I had to program the query optimizer to produce an estimate. Often those thoughts don't exactly match what SQL Server does but I find it helpful to keep in mind that the optimizer is almost always working with imperfect or limited information. Let's go back to your first query:

DECLARE @c INT=300; 
SELECT * FROM t1 WHERE b>@c; 

Without a RECOMPILE hint the optimizer doesn't know the value of the @c local variable. There isn't enough information here to do a proper estimate, but the optimizer needs to create an estimate for all queries. So what can a programmer of the optimizer do? I suppose that a sophisticated algorithm that makes a guess based on the distribution of values in the column would be possible to implement, but Microsoft uses a hardcoded guess of 30% for an unknown inequality. You can observe this by changing the value of the local variable:

DECLARE @c INT=999999999; 
SELECT * FROM t1 WHERE b>@c; 

The estimate is still 300 rows. Note that the estimates do not need to be consistent between queries. The following query also has an estimate of 300 rows as opposed to 1000 - 300 = 700:

DECLARE @c INT=300; 
SELECT * FROM t1 WHERE NOT (b>@c); 

Conceptually, it can be rewritten as

DECLARE @c INT=300; 
SELECT * FROM t1 WHERE b<=@c; 

Which will have the same hardcoded 30% estimate.

Your second query does not use a local variable, so the optimizer can directly use the statistics object that you created on the column to aid with estimation. Histograms can only have up to 201 steps and your table has 1000 distinct values so there are some filter values in which the optimizer does not have complete information. In those cases it needs to make more of a guess. Here's how you can view the histogram for the b column:

DBCC SHOW_STATISTICS ('t1', 's_b') WITH HISTOGRAM;

On my machine I got lucky in that 300 is the high value for one of the steps of the histogram:

histogram

You can find an explanation of the different columns in the documentation. Going forward I'm going to assume that you know what they all mean. The query requests rows with b > 300, so the histogram with a RANGE_HI_KEY of 300 is irrelevant, as are all of the steps with a lower RANGE_HI_KEY. For this query, I would program the optimizer to simply sum all of the RANGE_ROWS and EQ_ROWS values for the remaining 133 steps of the histogram. Those columns add up to 700 and that is SQL Server's estimate.

Other filter values may not give exact results. For example, the following two queries both have cardinality estimates of 704 rows:

SELECT * FROM t1 WHERE b>293; -- returns 707 rows
SELECT * FROM t1 WHERE b>299; -- returns 701 rows

Both estimates are very close but not exactly right. The histogram does not contain enough information for those values to provide an exact estimate.