SQL Server – Reasons for Long Database Recovery Times

sql serversql-server-2012

This is cross-posted at SSC, but restore is now going on 6 hours.

Hello,

Overnight I ran restores with NORECOVERY for a 5tb database, along with several other 250GB – 1.3tb databases. I did that because sometimes our 3rd party backup software throws wacky LSN errors if it thinks something timed out that really didn't. Anyway, the restore completed with no errors. This morning I started running the RESTORE WITH RECOVERY commands to bring them online for DBCC checks.

Ran on the first database fine, got stuck on the 5tb database for about an hour, then I ran the commands on the remaining six databases in parallel to the 5tb. All of them recovered very quickly. The 5tb database has been restoring now for about 3 hours.

Any thoughts on:

Why this in particular would be taking much longer than both:

a) how long it usually takes to recover, and
b) much longer than other databases took to recover?

If I should cancel/kill the transaction and try to start over

I can see it doing something in sp_WhoIsActive, i.e. informational columns are incrementing, and the wait info changes (though the wait that it always seems to be on is IO_COMPLETION), though status seems to always be suspended.

While researching I came across one article that suggested high VLF count can cause this, and there are about 5k VLFs in the log file, but I'm on a version of 2012 that is patched for the issue:

Microsoft SQL Server 2012 – 11.0.5532.0 (X64)
Jul 14 2014 15:00:27
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)

Thanks

EDIT:

This is on my DBA box that I use for offloading DBCC checks, so I'm using a good backup. The growth settings for these databases are controlled by best practices issued by the 3rd party vendor. I'm not allowed to touch them. They're set to percentages, which I disagree with, but it's out of my control. Management isn't open to changing because we charge clients by space used for hosting. I do monthly maintenance on VLFs but they have been this high before and I've run the restore with no issues.

EDIT 2:

Finished after 21 hours!

RESTORE DATABASE successfully processed 0 pages in 77454.736 seconds (0.000 MB/sec).

Best Answer

If this is a PRODUCTION environment, definitely DO NOT CANCEL recovery. This will only prolong your pain. DO NOT RESTART SQL SERVER.

  1. Wait until recovery is finished, no matter how long it takes.
  2. Ensure you have a good backup.
  3. Reduce the Virtual Log File count on databases that have an unreasonably high number of them.
  4. Set autogrowth on the database to a reasonable number, NOT 10MB or 10%! If your log file is currently 50GB or larger, for instance, you might want to consider a growth size of 4GB or 8GB.
  5. Use this script, or some variation thereof, to monitor VLF counts on your databases, and pro-actively manage them so you don't have this problem in future.

This answer has many great points about VLF performance.

Kimberly Tripp has an excellent article about how to get better transaction throughput.

MSDN has a blog post about slow database recovery from high VLF counts. The article is targeted at SAP users, but the information applies equally well to all SQL Server installations.

Linchi Shea has some good points over at sqlblog.com

If this is some non-production environment, then all bets are off. You can kill that sucker as much as you like to find out the cause/solution.