Sql-server – Slow query in production, bad execution plan or bad index

execution-planperformancequery-performancesql serversql-server-2017t-sql

I've just fixed a production performance issue by dropping an index and recreate it.
I suspect dropping the index also dropped executions plans that used it and one of them happen to be bad.

Arguments in favor of bad execution plan :

  • Before dropping the index, I looked up the last update date for the statistics on the given table and they were up to date.
  • My DBA has put in place Hallengren's index and statistic maintenance solution
  • The slow query was a select statement executed from sp_executesql with dates parameters. Executing the same select statement without sp_executesql was fast, but also didn't use the same execution plan.

Arguments against bad execution plan :

  • Before dropping the index, we went real wild and ran the forbidden dbcc freeproccache to clear any bad plan, but this didn't fix or change the performance issue.

Note:

  • The slow query happen to be using a table indexed by date. However,
    there is wide differences in the amount of records for each date. In
    other word, any given date range from few records to more than 100k
    and it is pretty random.

  • The database is running under compatibility level 140 (SQL Server
    2017)

Was the source of the problem a bad plan or a stale index?
If it is a bad plan, then why dbcc freeproccache didn't work in getting rid of it?

Plan before issue was solved

Plan after issue was solved

Edit :

From what I know, this looked very much like a bad plan, but somehow dbcc freeproccache didn't work. So, I'm left in the dark, confused about the whole situation.

Best Answer

Here's a timeline of the events you described, as well as my commentary on what was the likely cause of your bad plan.

  1. You're alerted of a slow query that was previously fast

Bad times, I empathize!

  1. You go crazy and run "the forbidden dbcc freeproccache" and it doesn't help

Usually, clearing the cache fixes problems like this if the problem is parameter sniffing.

  1. You drop and recreate a nonclustered index that's being used by the bad plan. Now your problem is solved

Another reason for bad plans is bad stats. If the problem is bad stats, then clearing the procedure cache will just result in the same bad execution plan being produced again (based on the bad stats).

Creating an index on a column creates new statistics on that column WITH FULLSCAN:

Compute statistics by scanning all rows in the table or indexed view.

These are usually the best stats, because they include the most comprehensive view of the data.

Statistics can be automatically updated over time, as rows in the table are changed / added / deleted. SQL Server will use different "sampling percents" based on the size of the table. From that same docs page:

For most workloads, a full scan is not required, and default sampling is adequate. However, certain workloads that are sensitive to widely varying data distributions may require an increased sample size, or even a full scan.

It sounds like what happened was:

  • an automatic stats update ran for this index, with a low or otherwise problematic sample rate
  • this triggered the creation of a new execution plan, and that plan was bad
  • you re-created the index, which produces good stats again

You mentioned you're using Ola's index and stats maintenance solution. You may want to add a special case for this table / index to update stats with FULLSCAN, as it sounds like the table's data is skewed enough to warrant that.

If you want to avoid scanning the whole table (I don't know how big it is), you can experiment with different sample percentages (starting at 100, and dropping by 5 or 10 percent until you get the bad plan).


Looking the execution plans you provided (thanks!), you can see evidence for my theory. Here's the index for the bad plan:

<StatisticsInfo Database="[BDM]" Schema="[dbo]" Table="
                [SubTypeInstanceProcessAdvancements]" Statistics="[AsOfDate]" 
                ModificationCount="288" SamplingPercent="23.4265" 
                LastUpdate="2019-08-15T22:00:09.91" />

And here's the good plan:

<StatisticsInfo Database="[BDM]" Schema="[dbo]" Table="
                [SubTypeInstanceProcessAdvancements]" Statistics="[AsOfDate]" 
                ModificationCount="0" SamplingPercent="100" 
                LastUpdate="2019-08-16T13:50:10.43" />

The bad plan has a sample percent of 23.4265, while the good plan had stats sampled at 100% (because of the index recreation).