SQL Server – Slow First Stored Procedure Execution After Restoring Log Shipping to Secondary Server

log-shippingrestoresql serversql-server-2012stored-procedures

We've set up log shipping to a secondary SQL server on Standby/ Read-Only to offload all SSRS report generation.
This works fine within the restrictions imposed by:

Kicking out the user during the transaction log restore (we got around this by setting up multiple instances and restoring the most recent transaction logs using a round-robin schedule)
The data being out of date by, at most, the time-frame indicated by the scheduled transaction log backup/ restore job.

Unfortunately, the first time any/ all stored procedure are run, after the transaction log was restored, it takes a much longer time to complete than normal. All subsequent executions of that same stored procedure complete within the expected time. If we then execute another stored procedure, the first time it is slow and all subsequent executions complete in the expected time.

For reference, the difference in execution is ~00:02 normally compared to ~01:00 on the first run.

I assume this has something to do with either the server execution statistics or the stored procedure parameter sniffing/ stored execution plan.
Is there any way to get around this issue? Or is this inherent to the transaction log restore?

If it was just the very first execution of any stored procedure we could get around this easily by executing any stored procedure upon restore, but it appears to affect the first time all stored procedures are executed.

I tried running count( * ) on the 11 tables the stored procedure I'm using for testing touches. The first run took 00:32, and subsequent count(*) took 00:00. Unfortunately, this did not have any impact on the first run of the stored procedure.

I don't see any results on either my primary or secondary servers for is_temporary stats, either before or after execution of a stored procedure.

I'm currently on SQL Server 2012

Query Exection Plan:
The query execution plan at first glance appears significantly different, however, upon saving the execution plan and opening the .sqlplan file generated they are exactly the same. The difference appears to be coming from the different versions of SSMS I am using, 2014 on the primary server and 2018 on the secondary. When viewing the execution plan on the secondary it shows underneath every node's % and time cost ### of ### (##%) – neither those numbers, nor the actual execution plan change upon further executions.
I also included client statistics and they show almost exactly the same, the only difference being the primary server executes with 1.4 seconds of Wait time on server replies and the secondary takes 81.3 seconds.

I do see a large number of PAGEIOLATCH_SH locks from the first execution, as you predicted:

diff after first exec vs diff after second exec
waiting_tasks_count    10903    918  
wait_time_ms          411129  12768

One of the odd things about this situation, is, except for the round-robin multiple instances part of the setup we already have our production SSRS server reading from a standby/ read-only database that is fed by periodic transaction logs and do not experience these slow downs on the first execution of a stored procedure. Our users are kicked off every time the transaction log is restored, though, which is the problem the above setup is supposed to resolve.

Best Answer

There are a few possible things going on here, here's a non-exhaustive list:

the execution plan cache is cleared by the log restore, so plans will need to be recompiled the first time. If your plans have long compile times, this could explain the difference. You didn't mention exactly how long the delay is on the first run compared to the subsequent runs
- this one seems like the least likely - you can see your plan compilation time in actual execution plan XML
the buffer pool is also cleared during the restore, so all data has to be read from disk on the first execution
- if this is the case, you'll likely see high PAGEIOLATCH* waits during the initial run if you check wait stats

Some things you could do to mitigate this are

"warm up" the buffer cache (by reading all data from important tables into memory using SELECT COUNT(*) FROM dbo.YourTable),
"warm up" the proc cache by running all the critical stored procedures as a post-restore step

Providing us with a "fast" and "slow" example of an execution plan could help us track down exactly which thing is happening.

If you are on SQL Server 2012 or newer, then it's possible that sync stats updates are causing the delay. These "readable secondary stats" get created in TempDB, since the log shipping secondary is read-only. You can read more about that here (the article is about AGs, but the same thing applies in this scenario):

AlwaysOn: Making latest statistics available on Readable Secondary, Read-Only database and Database Snapshot

If this is the problem causing your slowdown, then one solution would be to find those stats, and then create them in the production database, so that they are up-to-date and available after the restore. You can look for temp stats with this query:

SELECT * FROM sys.stats WHERE is_temporary = 1;

Based on the wait stats you provided, and the fact that the plans are the same, this is pretty conclusively due to the buffer pool being cleared by the log restore.

On a normal run, you get 12,768 ms (almost 13 seconds) of IO waits.
On the first run, you get 411,129 ms (almost 7 minutes) of IO waits.

The SELECT COUNT(*) approach you tried may not have helped due to different indexes being used by the actual procedure vs the COUNT(*) query. You have a few options here:

Go through each execution plan and make note of the indexes being used, and then pull those indexes into memory as a post-restore step - using index hints this time (SELECT COUNT(*) FROM dbo.YourTable WITH (INDEX (IX_Index_Being_Used_By_Proc)))
Go through the process of scripting out a process to run each procedure as a post-restore step (this seems a bit easier than option 1)
Tune the queries so that they don't need to do so many reads (not sure how feasible this is)
Speed up the I/O subsystem - get faster disks, local SSDs, more channels to the SAN, etc (this is probably the hardest and most expensive option

Best Answer

Related Solutions

Sql-server – SQL Server Log shipping standby db stuck restoring

SQL Server Log Shipping – Monitor Server Down, Copying but Not Restoring

Related Question