Sql-server – OPTION FORCE ORDER improves performance until rows are deleted

hintsoptimizationperformancequery-performancesql serversql-server-2008

I have a somewhat complex SQL Server 2008 query (about 200 lines of fairly dense SQL) that wasn't performing as I needed it. Over time, performance dropped from about .5 seconds to about 2 seconds.

Taking a look at the execution plan, it was pretty obvious that by reordering the joins, performance could be improved. I did, and it did… down to about .3 seconds. Now the query has the "OPTION FORCE ORDER" hint, and life is good.

Along comes me today, cleaning up the database. I archive about 20% of the rows, taking no action in the relevant database except deleting rows… the execution plan gets TOTALLY hosed. It completely misjudges how many rows certain subtrees will return, and (for example) replaces a:

<Hash>

with

<NestedLoops Optimized='false' WithUnorderedPrefetch='true'>

Now the query time spikes from about .3s to about 18s. (!) Just because I deleted rows. If I remove the query hint I'm back to about 2s query time. Better, but worse.

I've reproduced the issue after restoring the database to multiple locations and servers. Simply deleting about 20% of rows from each table always causes this issue.

  1. Is this normal for a forced join order to make the query estimates to be
    completely inaccurate (and thus query times unpredictable)?
  2. Should I just expect that I'll have to
    either accept sub-optimal query performance, or watch it like a hawk
    and frequently manually edit query hints? Or maybe hint every join as well?
    .3s to 2s is a big hit to take.
  3. Is it obvious why the
    optimizer blew up after deleting rows? For example, "yes, it took a sample
    scan, and because I archived most of the rows earlier in the data
    history the sample yielded sparse results, so it underestimated the need for
    a sorted hash operation"?

If you'd like to see execution plans, please suggest a location I can post them. Otherwise, I've sampled the most stunning bit. Here's the fundamental mis-estimate, numbers in parens are (estimated:actual) rows.

                             /  Clustered Index Scan (908:7229)
Nested Loops (Inner Join) --<
                             \  NonClustered Index Seek (1:7229)

Note the inner loop is expected to scan 908 rows, but instead scans 52,258,441. If it had been accurate, this branch would have ran about 2ms, rather than 12sec. Before deleting the rows, this inner join estimate was only off by a total factor of 2, and was performed as a hash match on two clustered indexes.

Best Answer

Is this normal for a forced join order to make the query estimates to be completely inaccurate (and thus query times unpredictable)?

The use of FORCE ORDER isn't making estimates inaccurate, the deletion of rows did. Forcing an update of statistics on the table may improve the estimation accuracy.

Should I just expect that I'll have to either accept sub-optimal query performance, or watch it like a hawk and frequently manually edit query hints? Or maybe hint every join as well? .3s to 2s is a big hit to take.

Preferable would be to ensure the optimiser is given the information it needs to generate the best plan, without using the FORCE ORDER hint. By doing so, it should cope better with changes to the underlying data distribution without requiring manual intervention. That said, if the nature of the data is such that cardinality could vary significantly hour by hour or day by hour, consider using a plan guide to ensure the plan is fixed.

Is it obvious why the optimizer blew up after deleting rows? For example, "yes, it took a sample scan, and because I archived most of the rows earlier in the data history the sample yielded sparse results, so it underestimated the need for a sorted hash operation"?

You didn't mention the row counts in the problem tables but it's likely that the deletions either:

  • didn't remove enough rows to to trigger a statistics update. This should occur when 20% of rows have been modified but there is the option to use trace flag 2371 to enable a dynamic threshold.
  • did trigger a statistics update but the sample gathered wasn't representative. Correct this by running a manual update WITH FULLSCAN.

You could also be running into good old fashioned parameter sniffing problems, for which there are myriad options to work around. WITH RECOMPILE might be an expensive option to specify with a query this large but it's worth investigating at both procedure and statement level.