Sql-server – Direct values on predicate produce not-so-good plan

cardinality-estimatesoptimizationperformancequery-performancesql serversql-server-2012

I'm using StackOverflow dump to run some tests.

In particular, I'm querying this table:

Posts Table

I have created this index:

Created index

I'm running the following query (just forcing an index to test alternatives)

Query with direct values

I get the following exec plan high a cost (66.63).

Not-so-good plan

These are the IO statistics after running this query:

enter image description here

Then, I run the same query by providing variables instead of direct values

Query with variables

I get a better plan (Cost is 0.4385).

Better plan

Statistics are also better:

Better statistics

At first…I thought SQL Server was not recognizing direct values as INT, but there are neither type mismatch nor implicit conversion warnings.

I also tried avoiding parallelism, but I still get a high-cost plan (and higher IO statistics) with MAXDOP 1 when passing direct values in the predicate.

When comparing both plans, there are different estimates:

enter image description here

What's wrong with direct values being passed as part of predicate?

Best Answer

What's wrong with direct values being passed as part of predicate?

The Difference Between Variables and Parameters

The optimizer uses the statistics density vector when calculating estimates for variables.

When "direct" or "static" values are embedded in the query directly, the statistics histogram is used instead. This is why you get different estimates, and thus different plans.

Here's my estimated plan: https://www.brentozar.com/pastetheplan/?id=SJCduTuKN

On my 2010 copy of the SO database, the density of the OwnerUserId column is .000003807058. Multiplying that by 3,744,192 rows = 14.2544 rows. Which is exactly the number of rows estimated to come out of IX_Posts_OwnerUserId.

screenshot of the index seek estimate

You can get this information about the statistics for that index by running this DBCC command:

DBCC SHOW_STATISTICS('dbo.Posts', 'IX_Posts_OwnerUserId');

Here's the (abbreviated) output:

Name                    Updated             Rows
IX_Posts_OwnerUserId    Apr  8 2019  8:33AM 3744192

All density     Average Length  Columns
3.807058E-06    4               OwnerUserId

Since PostTypeId is also part of the WHERE clause, statistics are automatically generated for that column as well. That density vector comes out to be .25 x 3,744,192 rows = 936,048 rows.

DBCC SHOW_STATISTICS('dbo.Posts', '_WA_Sys_00000010_0519C6AF');

And the output:

Name                        Updated             Rows
_WA_Sys_00000010_0519C6AF   Apr  8 2019  9:04AM 3744192

All density Average Length  Columns
0.25        4               PostTypeId

Since this is an "AND" predicate, the estimate uses the lower of the two.

When you use static values instead of variables, it uses the statistics histogram. This is in the third resultset of that SHOW_STATISTICS command. For the key you're using, here is the histogram entry:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
22656           13040       11371   305                 42.7541

This is where the estimate of 11,371 comes from in the "static values" plan.

The histogram can be a better estimate a lot of the time, as it deals with edge cases a bit better - since there will often be some outliers in a big table like this.

Costing Differences

In this specific case, the histogram produces an estimate that is exactly right. The cost of the plan produced is (correctly) higher than the one that uses the density vector, because it has to process many more rows.

The "lower cost" plan thinks that 14 rows will be produced by that seek, when in fact 11,371 rows are produced.

Logical Reads

Logical reads are slightly higher in the parallel plan because of nested loops prefetching. It doesn't seem to make a big difference on my machine - elapsed time for the queries was within 10 ms of each other.

The parallelism doesn't actually help anything, since all of the rows end up on one thread (on my machine anyway). Adding OPTION (MAXDOP 1) helps with the execution time, but doesn't remove the extra logical reads.

One potential solution to the "extra reads" problem for this query would be to avoid the key lookup entirely by adding PostTypeId as an included column:

CREATE INDEX IX_Posts_OwnerUserId ON dbo.Posts (OwnerUserId) 
INCLUDE (PostTypeId) 
WITH (DROP_EXISTING = ON);