Sql-server – Direct values on predicate produce not-so-good plan

cardinality-estimatesoptimizationperformancequery-performancesql serversql-server-2012

I'm using StackOverflow dump to run some tests.

In particular, I'm querying this table:

I have created this index:

I'm running the following query (just forcing an index to test alternatives)

I get the following exec plan high a cost (66.63).

These are the IO statistics after running this query:

Then, I run the same query by providing variables instead of direct values

I get a better plan (Cost is 0.4385).

Statistics are also better:

At first…I thought SQL Server was not recognizing direct values as INT, but there are neither type mismatch nor implicit conversion warnings.

I also tried avoiding parallelism, but I still get a high-cost plan (and higher IO statistics) with MAXDOP 1 when passing direct values in the predicate.

When comparing both plans, there are different estimates:

What's wrong with direct values being passed as part of predicate?

Best Answer

What's wrong with direct values being passed as part of predicate?

The Difference Between Variables and Parameters

The optimizer uses the statistics density vector when calculating estimates for variables.

When "direct" or "static" values are embedded in the query directly, the statistics histogram is used instead. This is why you get different estimates, and thus different plans.

Here's my estimated plan: https://www.brentozar.com/pastetheplan/?id=SJCduTuKN

On my 2010 copy of the SO database, the density of the OwnerUserId column is .000003807058. Multiplying that by 3,744,192 rows = 14.2544 rows. Which is exactly the number of rows estimated to come out of IX_Posts_OwnerUserId.

You can get this information about the statistics for that index by running this DBCC command:

DBCC SHOW_STATISTICS('dbo.Posts', 'IX_Posts_OwnerUserId');

Here's the (abbreviated) output:

Name                    Updated             Rows
IX_Posts_OwnerUserId    Apr  8 2019  8:33AM 3744192

All density     Average Length  Columns
3.807058E-06    4               OwnerUserId

Since PostTypeId is also part of the WHERE clause, statistics are automatically generated for that column as well. That density vector comes out to be .25 x 3,744,192 rows = 936,048 rows.

DBCC SHOW_STATISTICS('dbo.Posts', '_WA_Sys_00000010_0519C6AF');

And the output:

Name                        Updated             Rows
_WA_Sys_00000010_0519C6AF   Apr  8 2019  9:04AM 3744192

All density Average Length  Columns
0.25        4               PostTypeId

Since this is an "AND" predicate, the estimate uses the lower of the two.

When you use static values instead of variables, it uses the statistics histogram. This is in the third resultset of that SHOW_STATISTICS command. For the key you're using, here is the histogram entry:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
22656           13040       11371   305                 42.7541

This is where the estimate of 11,371 comes from in the "static values" plan.

The histogram can be a better estimate a lot of the time, as it deals with edge cases a bit better - since there will often be some outliers in a big table like this.

Costing Differences

In this specific case, the histogram produces an estimate that is exactly right. The cost of the plan produced is (correctly) higher than the one that uses the density vector, because it has to process many more rows.

The "lower cost" plan thinks that 14 rows will be produced by that seek, when in fact 11,371 rows are produced.

Logical Reads

Logical reads are slightly higher in the parallel plan because of nested loops prefetching. It doesn't seem to make a big difference on my machine - elapsed time for the queries was within 10 ms of each other.

The parallelism doesn't actually help anything, since all of the rows end up on one thread (on my machine anyway). Adding OPTION (MAXDOP 1) helps with the execution time, but doesn't remove the extra logical reads.

One potential solution to the "extra reads" problem for this query would be to avoid the key lookup entirely by adding PostTypeId as an included column:

CREATE INDEX IX_Posts_OwnerUserId ON dbo.Posts (OwnerUserId) 
INCLUDE (PostTypeId) 
WITH (DROP_EXISTING = ON);

Related Solutions

Sql-server – get SSMS to show me the Actual query costs in the Execution plan pane

I don't know of a way to do this in the plan from Management Studio, but this is one of the many things the free SentryOne Plan Explorer will do for you when you generate an actual plan from within the tool - it includes all the runtime metrics per statement.

Sql-server – Predicate not resulting in expected scan

Your query is looking for values between 2015-04-25 AND 2015-05-30.

Of which it looks like you have about 30 million.

Unfortunately there is a bug with the cardinality estimations where DATEDIFF is involved where the components get reversed.

SELECT DATEADD(WK, DATEDIFF(WK, GETDATE(), 0) - 4, -30) ,  
       DATEADD(WK, DATEDIFF(WK, GETDATE(), 0) - 4, 0) + 5

Returns a range of 1784-05-15 to 1784-06-19. Extremely likely you have no dates in that range in the table (or statistics histogram) so SQL Server will estimate that no rows will be returned out of the scan, and certainly no need for 30 million lookups.

This is a fixed bug but requires trace flag 4199 enabled.

If that isn't possible you could try rephrasing the predicate in a different way that avoids the use of DATEDIFF or assigning the values to variables and using OPTION (RECOMPILE)