Sql-server – Slow query in production, bad execution plan or bad index

execution-planperformancequery-performancesql serversql-server-2017t-sql

I've just fixed a production performance issue by dropping an index and recreate it.
I suspect dropping the index also dropped executions plans that used it and one of them happen to be bad.

Arguments in favor of bad execution plan :

Before dropping the index, I looked up the last update date for the statistics on the given table and they were up to date.
My DBA has put in place Hallengren's index and statistic maintenance solution
The slow query was a select statement executed from sp_executesql with dates parameters. Executing the same select statement without sp_executesql was fast, but also didn't use the same execution plan.

Arguments against bad execution plan :

Before dropping the index, we went real wild and ran the forbidden dbcc freeproccache to clear any bad plan, but this didn't fix or change the performance issue.

Note:

The slow query happen to be using a table indexed by date. However,
there is wide differences in the amount of records for each date. In
other word, any given date range from few records to more than 100k
and it is pretty random.
The database is running under compatibility level 140 (SQL Server
2017)

Was the source of the problem a bad plan or a stale index?
If it is a bad plan, then why dbcc freeproccache didn't work in getting rid of it?

Plan before issue was solved

Plan after issue was solved

Edit :

From what I know, this looked very much like a bad plan, but somehow dbcc freeproccache didn't work. So, I'm left in the dark, confused about the whole situation.

Best Answer

Here's a timeline of the events you described, as well as my commentary on what was the likely cause of your bad plan.

You're alerted of a slow query that was previously fast

Bad times, I empathize!

You go crazy and run "the forbidden dbcc freeproccache" and it doesn't help

Usually, clearing the cache fixes problems like this if the problem is parameter sniffing.

You drop and recreate a nonclustered index that's being used by the bad plan. Now your problem is solved

Another reason for bad plans is bad stats. If the problem is bad stats, then clearing the procedure cache will just result in the same bad execution plan being produced again (based on the bad stats).

Creating an index on a column creates new statistics on that column WITH FULLSCAN:

Compute statistics by scanning all rows in the table or indexed view.

These are usually the best stats, because they include the most comprehensive view of the data.

Statistics can be automatically updated over time, as rows in the table are changed / added / deleted. SQL Server will use different "sampling percents" based on the size of the table. From that same docs page:

For most workloads, a full scan is not required, and default sampling is adequate. However, certain workloads that are sensitive to widely varying data distributions may require an increased sample size, or even a full scan.

It sounds like what happened was:

an automatic stats update ran for this index, with a low or otherwise problematic sample rate
this triggered the creation of a new execution plan, and that plan was bad
you re-created the index, which produces good stats again

You mentioned you're using Ola's index and stats maintenance solution. You may want to add a special case for this table / index to update stats with FULLSCAN, as it sounds like the table's data is skewed enough to warrant that.

If you want to avoid scanning the whole table (I don't know how big it is), you can experiment with different sample percentages (starting at 100, and dropping by 5 or 10 percent until you get the bad plan).

Looking the execution plans you provided (thanks!), you can see evidence for my theory. Here's the index for the bad plan:

<StatisticsInfo Database="[BDM]" Schema="[dbo]" Table="
                [SubTypeInstanceProcessAdvancements]" Statistics="[AsOfDate]" 
                ModificationCount="288" SamplingPercent="23.4265" 
                LastUpdate="2019-08-15T22:00:09.91" />

And here's the good plan:

<StatisticsInfo Database="[BDM]" Schema="[dbo]" Table="
                [SubTypeInstanceProcessAdvancements]" Statistics="[AsOfDate]" 
                ModificationCount="0" SamplingPercent="100" 
                LastUpdate="2019-08-16T13:50:10.43" />

The bad plan has a sample percent of 23.4265, while the good plan had stats sampled at 100% (because of the index recreation).

Related Solutions

Sql-server – Incorrect query results for execution plan, unsure of exact cause of problem

You should consider refactoring to use the new paging clause in SQL Server 2012, ( OFFSET...FETCH ), rather than that old-fashioned 'roll your own paging' eg

SELECT 
...
FROM TableC
WHERE AccountID = @P3
    AND Processed = @P4
    AND Deleted = 0
    AND Source = @P5  

ORDER BY ProcessedDate DESC
OFFSET 0 ROWS FETCH NEXT 25 ROWS ONLY

It's potentially more efficient. Here's a sample SQL Fiddle.

Sql-server – 100% CPU with bad execution plan

Look at the plans, there are a few differences between the good one and the bad ones. First thing to notice is that the good plan performs a seek on InverterDayData where as both the bad plans perform a scan. Why is this, if you check the estimated rows, you'll see that the good plan is expecting 1 row where as the bad plans are expecting 6661 and around 7000 rows.

Now take a look at the compiled parameter values,

Good Plan @p__linq__1 ='2016-11-26 00:00:00.0000000' @p__linq__0 =20825

Bad Plans @p__linq__1 ='2018-01-03 00:00:00.0000000' @p__linq__0 = 20686

so it's looking to me like it's a parameter sniffing issue, what parameter values are you passing into that query when it's performing badly?

There is an index recommendation in the bad plans on InverterDayData that looks sensible, I'd try running that in and seeing if it helps you. It might allow SQL to perform a scan on the table.

Edit :

Best Answer

Related Solutions

Sql-server – Incorrect query results for execution plan, unsure of exact cause of problem

Sql-server – 100% CPU with bad execution plan

Related Question