Sql-server – Very slow SQL Server database recovery

recoverysql serversql-server-2012

We have a very large DB (6TB) that is taking a long time in recovery, after we rebooted the server.

After doing some research we understand that it is because of the large number of VLFs. Right now, it is a waiting game.

In the SQL Server log, we see that the ETA is 24 hours! This is for the first of the three phases!

We have announced everyone that DB will be available in 24 hours, not knowing there are 2 more phases.

Does anyone know, how long the other 2 phases take? Just a ballbark number would do. By any chance are the next 2 phases shorter than the first one?

Best Answer

Here's what we found:

Even though the SQL LOG claimed an ETA of 24 hours, the DB recovered in 12 hours. So the numbers were grossly over-estimated.

Even though SQL LOG was showing percentage recovery completion, it didn't go all the way up to 100%. At around 27%, SQL log said everything okay and recovery complete.

In our case 2 things caused this issue:

  1. Large number of VLFs (697 in our case, for a 70GB log. Although, 697 doesn't seem that large to me)
  2. In our case 800,000+ transactions were required to roll forward. So that could have contributed significantly too.

The answer to my original question: It seems the phase 1 is the slowest; 2 and 3 went very quickly.

We got an answer from Microsoft. Here are some notes from them:

There are 3 phases of the recovery process

Analysis: In this phase the transaction log is analyzed and records the information about the last checkpoint and creates on Dirty Page Table, this might capture all the dirty pages details.

Redo: It’s a roll forward phase. When this phase completes database becomes online so the database was online

Undo: It’s rollback phase. It holds the list of the active transaction from the analysis phase basically undoing the transactions.

In Summary:
This is normal for SQL Server when there is an unexpected shutdown when there is an execution of long running transaction. A database goes through 3 phases of recovery and then comes online – Enterprise edition allows the database to come online after second phase. This means that there is a rollback running at the moment; need to wait for this to complete. Let the server finish the rollback process if a long-running transaction was terminated. Terminating the server process during the rollback of long-running transaction results in long recovery time.

This might look counteractive; I'm going to post it anyway. If I have to refer in the future, or for folks who might be a victim of this scenario.

It took 13 hours to recover the db. Here are the rough numbers.

Total Recovery time (about 13 hours) = 11 hours analysis + 1 hour redo + 0 hours undo