Sql-server – OPTION FORCE ORDER improves performance until rows are deleted

hintsoptimizationperformancequery-performancesql serversql-server-2008

I have a somewhat complex SQL Server 2008 query (about 200 lines of fairly dense SQL) that wasn't performing as I needed it. Over time, performance dropped from about .5 seconds to about 2 seconds.

Taking a look at the execution plan, it was pretty obvious that by reordering the joins, performance could be improved. I did, and it did… down to about .3 seconds. Now the query has the "OPTION FORCE ORDER" hint, and life is good.

Along comes me today, cleaning up the database. I archive about 20% of the rows, taking no action in the relevant database except deleting rows… the execution plan gets TOTALLY hosed. It completely misjudges how many rows certain subtrees will return, and (for example) replaces a:

<Hash>

with

<NestedLoops Optimized='false' WithUnorderedPrefetch='true'>

Now the query time spikes from about .3s to about 18s. (!) Just because I deleted rows. If I remove the query hint I'm back to about 2s query time. Better, but worse.

I've reproduced the issue after restoring the database to multiple locations and servers. Simply deleting about 20% of rows from each table always causes this issue.

Is this normal for a forced join order to make the query estimates to be
completely inaccurate (and thus query times unpredictable)?
Should I just expect that I'll have to
either accept sub-optimal query performance, or watch it like a hawk
and frequently manually edit query hints? Or maybe hint every join as well?
.3s to 2s is a big hit to take.
Is it obvious why the
optimizer blew up after deleting rows? For example, "yes, it took a sample
scan, and because I archived most of the rows earlier in the data
history the sample yielded sparse results, so it underestimated the need for
a sorted hash operation"?

If you'd like to see execution plans, please suggest a location I can post them. Otherwise, I've sampled the most stunning bit. Here's the fundamental mis-estimate, numbers in parens are (estimated:actual) rows.

                             /  Clustered Index Scan (908:7229)
Nested Loops (Inner Join) --<
                             \  NonClustered Index Seek (1:7229)

Note the inner loop is expected to scan 908 rows, but instead scans 52,258,441. If it had been accurate, this branch would have ran about 2ms, rather than 12sec. Before deleting the rows, this inner join estimate was only off by a total factor of 2, and was performed as a hash match on two clustered indexes.

Best Answer

Is this normal for a forced join order to make the query estimates to be completely inaccurate (and thus query times unpredictable)?

The use of FORCE ORDER isn't making estimates inaccurate, the deletion of rows did. Forcing an update of statistics on the table may improve the estimation accuracy.

Should I just expect that I'll have to either accept sub-optimal query performance, or watch it like a hawk and frequently manually edit query hints? Or maybe hint every join as well? .3s to 2s is a big hit to take.

Preferable would be to ensure the optimiser is given the information it needs to generate the best plan, without using the FORCE ORDER hint. By doing so, it should cope better with changes to the underlying data distribution without requiring manual intervention. That said, if the nature of the data is such that cardinality could vary significantly hour by hour or day by hour, consider using a plan guide to ensure the plan is fixed.

Is it obvious why the optimizer blew up after deleting rows? For example, "yes, it took a sample scan, and because I archived most of the rows earlier in the data history the sample yielded sparse results, so it underestimated the need for a sorted hash operation"?

You didn't mention the row counts in the problem tables but it's likely that the deletions either:

didn't remove enough rows to to trigger a statistics update. This should occur when 20% of rows have been modified but there is the option to use trace flag 2371 to enable a dynamic threshold.
did trigger a statistics update but the sample gathered wasn't representative. Correct this by running a manual update WITH FULLSCAN.

You could also be running into good old fashioned parameter sniffing problems, for which there are myriad options to work around. WITH RECOMPILE might be an expensive option to specify with a query this large but it's worth investigating at both procedure and statement level.

Related Solutions

Sql-server – Inclusion of ORDER BY on query that returns no rows drastically affects performance

I agree with Martin Smith's answer, but the problem is not simply one of statistics, exactly. The statistics for the foreignId column (assuming automatic statistics are enabled) accurately show that no rows exist for a value of 3 (there's just one, with a value of 7):

DBCC SHOW_STATISTICS (tinytable, foreignId) WITH HISTOGRAM

statistics output

SQL Server knows that things might have changed since the statistics were captured, so there might be a row for value 3 when the plan is executed. In addition, any amount of time might elapse between plan compilation and execution (plans are cached for reuse, after all). As Martin says, SQL Server contains logic to detect when sufficient modifications have been made to justify recompiling any cached plan for optimality reasons.

None of this ultimately matters, however. With one edge-case exception, the optimizer will never estimate the number of rows produced by a table operation to be zero. If it can statically determine that the output must always be zero rows, the operation is redundant and will be removed completely.

The optimizer's model instead estimates a minimum of one row. Employing this heuristic tends to produce better plans on average than would be the case if a lower estimate was possible. A plan that produces a zero-row estimate at some stage would be useless from that point on in the processing stream, since there would be no basis to make cost-based decisions (zero rows is zero rows no matter what). If the estimate turns out to be wrong, the plan shape above the zero row estimate stands almost no chance of being reasonable.

The second factor is another modelling assumption called the Containment Assumption. This essentially says that if a query joins a range of values with another range of values, it is because the ranges overlap. Another way to put this is to say that the join is being specified because rows are expected to be returned. Without this reasoning, costs would be generally underestimated, resulting in poor plans for a broad range of common queries.

Essentially, what you have here is a query that doesn't fit the optimizer's model. There's nothing we can do to 'improve' estimates with multi-column or filtered indexes; there's no way to get an estimate lower than 1 row here. A real database might have foreign keys to ensure that this situation could not arise, but assuming that is not applicable here, we are left with using hints to correct the out-of-model condition. Any number of different hint approaches will work with this query. OPTION (FORCE ORDER) is one that happens to work well with the query as written.

Sql-server – How to use merge hints to isolate complex queries in SQL Server

If you use a multi-statement UDF, then your inner select is executed exactly once for each outer row. The multi-statement UDF is treated as a black box: the execution plan will now show access to the objects used in your complex view.

On the other hand, a subquery and/or an inline UDF is flattened out by the optimizer. When this is the case, the execution plan will include access to the objects used in your complex view.

Best Answer

Related Solutions

Sql-server – Inclusion of ORDER BY on query that returns no rows drastically affects performance

Sql-server – How to use merge hints to isolate complex queries in SQL Server

Related Question