Sql-server – Exactly same query and plan but different duration and total reads. I Know parameter sniffing but

availability-groupsenterprise-editionsql serversql server 2014

I know what is parameter sniffing but this seems different, I have this simple query:

select * from A400RDATA

This table has only a cluster index and approximately 25000 rows. The query usually takes 1 second but now It took 20 seconds.

I found its execution plan in the cache and I used sp_BlitzCache (thanks Brent Ozar) for analyze every details -> It's true, the last execution takes 20 seconds.

So I run this query again with a little change, a leading space: "select * from A400RDATA" (note leading space before s of select). The leading space is a change in the query text so SQL Optimizer generates a new query plan.

The two query plans are identical, they are trivial plans and they just use the cluster index, but the total reads and duration are different.
Following the two different BlitzCache results:

enter image description here

Solarwind shows a drop in Page Life Expectancy, and SQL Server instance uses only one core (MAXDOP = 1).
I think MAXDOP and PLE can be the cause of the issue, what do you think ?
How can I be sure of that?
How can I know if the data was taken from disk or RAM?

I checked the logical and physical reads of the 20 seconds execution plan:

  • Execution time: 3
  • Physical Reads: 0
  • Rows Processed: 76000
  • Logical Writes: 0
  • Logical Reads: 3841

No physical reads…….
So PLE isn't the cause (perhaps).
I have read again the plan's details: 3 executions and 20 seconds of total duration, so each execution took 6 seconds but relatively little CPU time:

  • Total Duration (ms) = 20,013.84
  • Total CPU (ms) = 192.57

Can be CPU the problem? Maybe only one CPU is not enough and SQL brings the query in RUNNABLE state (waiting for CPU time).

Best Answer

The two query plans are identical, they are trivial plans and they just use the cluster index, but the total reads and duration are different. Following the two different BlitzCache results:

Total reads in that result set are most likely totalled across multiple executions. See the total rows column is 76,218 but the min, max and avg are identical to the other execution? This suggests the original plan had been executed 3 times so don't look at the total, look at the avg, which is identical.

You need to look at whether these reads are logical or physical because the variance in execution time suggests they're possibly logical and the physical read differences may shed more light on your situation.

Solarwind shows a drop in Page Life Expectancy, and Sql server instance uses only one core (MAXDOP = 1). I think MAXDOP and PLE can be the cause of the issue, what do you think ? How can I be sure of that ? how can I know if the data was taken from disk or ram ?

If you only have a single core then both query executions will use MAXDOP = 1, so that is unlikely to cause any variation. PLE could be a symptom of a deeper issue causing your performance issue, but it's not the root cause. If you're seeing PLE consistently low or has an extreme sawtooth pattern, you may find that determining the root cause of that helps identify this query's problem.

What you're describing doesn't sound like parameter sniffing because your query has no WHERE clause. It sounds like the query is having to read pages from disk on the first execution (physical read) into the buffer cache and subsequent reads are occurring in memory (logical read), which is typically much faster than physical reads.

Take a look at this article for a description of physical vs logical reads. Basically, when you first executed it took 20 seconds and this is caused by reading pages from disk. The next execution wasn't faster due to the trivial change or the new plan (as you stated, the actual plans were identical) but because those data pages were read from memory instead of disk. If you were to flush the buffer cache and re-run the queries again you would likely see the same behaviour - 1st execution = 20 secs, 2nd execution = 1 sec.

You can force this behaviour to occur to see it in real-time. On a non-production server, run the following:

DBCC DROPCLEANBUFFERS

SET STATISTICS IO ON

SELECT * FROM A400RDATA

SET STATISTICS IO OFF

Review the output in the messages pane and note down the Physical Reads and Logical Reads values. Run the below lines again:

SET STATISTICS IO ON

SELECT * FROM A400RDATA

SET STATISTICS IO OFF

Check the output and compare the physical reads and logical reads values again. Physical reads should have dropped significantly, most likely down to 0, and logical reads will be the same or higher, indicating pages are now read from memory without needing to fall back to disk reads. Execution time should be quicker as well.

How do we confirm this in production for queries that have already executed? Well it will be hard to be definitive without having monitoring and metrics collection in place prior, however, if you query sys.dm_exec_query_stats using the plan handle you can get stats on recent executions. Check the last, min and max physical reads. If min is 0 and max is some higher number this indicates some executions are reading from disk. If query execution has returned to 1 sec, last should be 0 or some really low number.

Also check the last, min and max elapsed time columns from that DMF as well to see if there is a variance in executions times for the 20-second execution plan, I suspect there will be. Solarwinds may help you correlate a drop in PLE with the increase in execution times for this query, since a drop in PLE could indicate these data pages aged out/were forced out of the cache and caused the physical reads to increase. Check out this article for useful hints on optimising/managing PLE to maintain your servers performance.