Hope someone can help me to learn how SQL Server uses stats to estimate the number of records.
Test Scripts
USE [tempdb]
GO
CREATE TABLE t1
(
a INT NOT NULL,
b INT NOT NULL,
c INT CHECK (c between 1 and 50),
CONSTRAINT pk_a primary key(a)
);
GO
INSERT INTO t1(a,b,c)
SELECT number, number%1000+1, number%50+1
FROM master..spt_values
WHERE type = 'P' AND number BETWEEN 1 and 1000;
GO
CREATE STATISTICS s_b ON t1(b);
CREATE STATISTICS s_c ON t1(c);
GO
Sample Query
DECLARE @c INT=300
SELECT * FROM t1 WHERE b>@c
SELECT * FROM t1 WHERE b>300
Query 1
Estimated Number of rows 300
Query 2
Estimated Number of rows 700
Questions
- Why such a huge difference between 1st and 2nd query row estimation.
- 2nd Query how SQL estimating to 700 from the Stats
- Any good article to learn these.
Best Answer
The queries give different estimates because the first one does not benefit from the Parameter Embedding Optimization. From the point of view of the optimizer the query might as well be:
To get an idea of how cardinality estimates might work for simple queries, sometimes I think about what I would do if I had to program the query optimizer to produce an estimate. Often those thoughts don't exactly match what SQL Server does but I find it helpful to keep in mind that the optimizer is almost always working with imperfect or limited information. Let's go back to your first query:
Without a
RECOMPILE
hint the optimizer doesn't know the value of the@c
local variable. There isn't enough information here to do a proper estimate, but the optimizer needs to create an estimate for all queries. So what can a programmer of the optimizer do? I suppose that a sophisticated algorithm that makes a guess based on the distribution of values in the column would be possible to implement, but Microsoft uses a hardcoded guess of 30% for an unknown inequality. You can observe this by changing the value of the local variable:The estimate is still 300 rows. Note that the estimates do not need to be consistent between queries. The following query also has an estimate of 300 rows as opposed to 1000 - 300 = 700:
Conceptually, it can be rewritten as
Which will have the same hardcoded 30% estimate.
Your second query does not use a local variable, so the optimizer can directly use the statistics object that you created on the column to aid with estimation. Histograms can only have up to 201 steps and your table has 1000 distinct values so there are some filter values in which the optimizer does not have complete information. In those cases it needs to make more of a guess. Here's how you can view the histogram for the
b
column:On my machine I got lucky in that 300 is the high value for one of the steps of the histogram:
You can find an explanation of the different columns in the documentation. Going forward I'm going to assume that you know what they all mean. The query requests rows with
b > 300
, so the histogram with aRANGE_HI_KEY
of 300 is irrelevant, as are all of the steps with a lowerRANGE_HI_KEY
. For this query, I would program the optimizer to simply sum all of theRANGE_ROWS
andEQ_ROWS
values for the remaining 133 steps of the histogram. Those columns add up to 700 and that is SQL Server's estimate.Other filter values may not give exact results. For example, the following two queries both have cardinality estimates of 704 rows:
Both estimates are very close but not exactly right. The histogram does not contain enough information for those values to provide an exact estimate.