Sql-server – Same Query plan, no new query competition, same hardware, but stored procedure runs slowly

performancequery-performancesql server

We have an SSRS report that is based on about 8 stored procedures which was running fine up until a few days ago. The report used to run in about 2 hours. Last Sunday (September 15 2019) it suddenly started timing out based on the SSRS timeout setting. We changed this to more time and found that it finishes after 6 hours. We're also seeing CPU contention when this report runs that we didn't see previously. What's strange is that I can't seem to find any apparent changes that would account for this:

The SP is using the same query plan it's used for weeks. So, in QueryStore you can see it completing in its usual 2 hours for days and suddenly spike 9/15. Even after a restart, the same plan was picked and it isn't being forced.

The underlying server is virtual, but we didn't change any resources until a few days later when we ended up adding more cores (which helped). Our VM guy isn't seeing any underlying hardware contention and from what I've seen, he's right.

The SSRS report itself wasn't changed. Same subscription, same configuration (other than the timeout setting being changed after the first night of bad performance). Statistics are set to update right before this report kicks off.

I'm pretty stumped here. I hate that for now we've had to settle with just throwing more resources at it. I tried searching for this issue online, but no one else seems to have posted about this specific scenario. Any suggestions? Below is the part of the code that takes up to 6 hours total.

EDIT: Link to Query plan: https://www.brentozar.com/pastetheplan/?id=SJ0XtSWwH

SELECT DISTINCT Object1.Column1
    , Function1(CASE 
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN ?
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN ?
            ELSE ?
            END) OVER (PARTITION BY Object1.Column1) AS Column4
    , Function1(CASE 
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN Function2(Column5,Object2.Column6, Object3.Column7)
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN Function2(Column5,Object2.Column6, Object3.Column7) * ?
            ELSE ?
            END) OVER (PARTITION BY Object1.Column1) AS Column8
    , Function1(CASE 
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN Function2(Column5,Object2.Column6, Object3.Column7)
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN Function2(Column5,Object2.Column6, Object3.Column7) * ?
            ELSE ?
            END) OVER (PARTITION BY Object1.Column1) 
            / NULLIF(Function1(CASE 
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN ?
            WHEN Object1.Column2 IN (?) AND LEFT(Object1.Column3,?) = ?
                THEN ?
            ELSE ?
            END) OVER (PARTITION BY Object1.Column1), ?)
            / ? AS Column9
INTO Object4
FROM Object5 Object1 
JOIN Schema1.Object6 Object7 ON Object7.Column1 = Object1.Column1
LEFT JOIN Schema1.Object8 Object9 ON Object9.Column10 = Object1.Column10
    LEFT OUTER JOIN Object10 Object3 ON Object3.Column10 = Object9.Column10
                        AND Object3.Column11 = ?
    JOIN (SELECT Object11.Column10, Function3(Object11.Column12) AS Column13 
        FROM Object12 Object11 (NOLOCK)
        WHERE Object11.Column14 = ?
        GROUP BY Object11.Column10) Object13 ON Object9.Column10 = Object13.Column10

    JOIN Object12 Object2 (NOLOCK) ON Object2.Column10 = Object13.Column10
                        AND Object2.Column12 = Object13.Column13

    LEFT JOIN Database1.Schema1.Object14 Object15 ON Object2.Column15 = Object15.Column16 
                                  AND Object15.Column14 = ?
                              AND Object2.Column17 IN (?)

Best Answer

One possible cause for this sudden increase in duration is that there is another process blocking this query from making progress.

You mentioned that you have query store enabled. I would run a query like this, and look for the point where avg_duration jumps up from 2 hours to 6 hours (after 9/15):

SELECT 
    qsrs.first_execution_time,
    qsrs.avg_duration,
    qsrs.avg_cpu_time,
    qsrs.avg_logical_io_reads,
    qsrs.avg_logical_io_writes,
    qsrs.avg_dop,
    qsrs.avg_rowcount
FROM sys.query_store_runtime_stats qsrs 
WHERE qsrs.plan_id = @your_plan_id
ORDER BY qsrs.first_execution_time DESC;

If the problem is blocking, you should see all of the other values (CPU, reads, writes, etc) stay mostly the same. You should then be able to track this problem down by finding other queries running during this time period (either using Query Store, or running sp_WhoIsActive, or logging it to a table).

If other values in the runtime stats did change, then at least this will be a clue as far as where to look next. For instance, I included avg_dop because maybe the query has suddenly stopped being able to use parallelism as much because of a settings change you didn't know about.


Since you've determined that CPU increased with duration, but the other metrics stayed the same, I'm tempted to think that the VM is fighting to get CPU time on the host, possibly because it's oversubscribed and there are other CPU heavy workloads running on other guests.

Jonathan Kehayias has an interesting post that shows SOS_SCHEDULER_YIELD waits could result from this situation (CPU Ready Impact on SOS_SCHEDULER_YIELD). If you're logging wait stats, you might look to see if you can correlate the increase in duration with increases in this wait type. Or better yet, work with your VM admin to see if there's evidence on the host side: