Sql-server – SQL Server 2008 R2 Restore constantly hanging

restoresql-server-2008-r2

I have a SQL Server 2008 R2 SP2 test server that I have to refresh the databases on fairly often, but every time I have to do a restore on, it will hang on me. I use a custom script to check the progress of where the restore is at, and it hangs at various times. One time it may complete 56%, the next time 64%, the next time 17% and so on. It does not stop at consistent times. And when I say that it hangs, it just stops writing. As far as SQL Server is concerned, the restore is still running though. So its like it still doing its thing, but nothing occurs. If you leave the restore running without killing the process it will run forever. I have let it run for 12 hours before with no progress

Eventually, I can get the restore to complete after trying again over and over, and eventually it will go one time and finish. But it takes what should only be an hour, and turns it into 5 hours of constantly retrying.

So here is some information on the server –

It is a Virtual Machine,

Windows Server 2008 R2 Standard,
SQL Server 2008 R2 sp2 Standard,
32 GB RAM,
4 Proc
Here are the database sizes that are the problem. While not really, they are not overly large either.

Database A – 71 GB,
Database B – 11 GB,
Database C – 27 GB

Here is what I have done:

I have checked all the logs (SQL Server and Windows) to look for any errors or any messages that looked suspicious. There is nothing.

I have tried to run the restores at nearly every hour of the day, but it does not matter, it will hang at any point.

I have tried dropping the databases\deleting the .mdf & .ldf files and recreating the databases and then restoring, still get the issue

I have traced the restore to see if anything is occurring. Nothing.

I have made sure all connections where killed before starting the restore. Also, have placed the databases in single user mode before starting the restore.

I have no problems restoring the exact same backups on other servers, so no issue with the backups

Storage guys looked at the datastore\LUN and checked NAR files and said everything looked good.

They also migrated the vmdk to another LUN, but we still got the issue.

Storage guys sent the NAR files from the Array to EMC for examination and they say no issues with the storage

VMWare guys looked at the Virtual Machine and saw no issue with it.

I also disabled all the Anti-virus software and then tried it again, still will hang

I have ran the restores using the GUI, writing out the script, and using custom scripts, same result

Ran a repair on SQL Server

Nothing unusual when looking at the Wait Stats. Shows same thing as you would see when doing a successful restore.

Any thoughts on anywhere else to look or possible reasons here. Or any tracking or counters that you can recommend to pinpoint the issue.

My thoughts are an issue with storage or the server, but the Ops guys are adamant there is no issue with either of those.

Also, rebuilding this server at this time is NOT an option. Even though it is Test, it is part of a farm of servers that makes up a Claims environment. We will eventually rebuild this entire environment later this year, but it is a large undertaking.

The next step is getting on the phone with Microsoft, but waiting on the Ops guys to take one more look. So thought I would check here.

Thanks for your help!


Results of requested waitstat script –

enter image description here

Best Answer

I think I have found the issue.

The backups are stored our data domain, so when performing the restore we select the backups from this location. I think there is some type of connection issue between this server and the data domain. Because when I move the backup files over to the server and restore from the server directly, I cannot reproduce the issue and the restores are probably 10x faster.I set a continuous restore of a database (11 GB) that has been running for about 25 minutes and has completed the restore successfully 23 times with no hanging

My thinking is that the connection issue probably lies somewhere with our test server still. Because we use the Data Domain for all of our production backups (although it is changing) and have no problems restoring elsewhere, including the databases I have been working with. Not to mention, we have automated nightly restores on other servers that have never failed.

I think this at least pinpoints the issue. I will look into where the connection issue is and work with the server\networking guys to see if we can resolve it.

Thanks you guys for your quick assistance