Sql-server – Rollback taking forever

sql server

Issue : I was moving a non clustered index on a 3TB table from one
file-group to another (userfilegroup to index) and it took 11 hours and never completed and i had to kill the process as it was blocking other processes.

Rollback is taking forever. It has been 10 hours since i killed the process and no clue what is happening. Locks are still held and are impacting other processes.

I see the percent_complete as 0% and kill with statusonly also shows 0% after so many hours.

Can you please suggest me if there is a way that i can check if rollback is still doing anything or just got stuck . I am afraid to restart the server as it has to go through the rollback process anyways.

Only positive thing that see is under sp_who2 diskIO moving for this SPID.

****Any help is much appreciated….****

Best Answer

Bad news: rolling work forward in SQL Server is multi-threaded, but rollbacks are single-threaded. If you had lots of cores doing work for 11 hours, it's entirely possible that it could take dozens of hours (or several days) to do the rollback.

Try using sp_BlitzFirst @ExpertMode = 1 (disclaimer: I'm one of the authors of that open source tool), and look at the file stats section. It'll show you which data & log files have had physical reads & writes in that time span. That'll show you if your rollback is continuing to move along.

You mentioned "impacting other processes" - keep in mind that your rollback speed can be reduced by other processes running on the server. You may also want to keep an eye on your drive free space and transaction log sizes. Your transaction was a logged operation, and SQL Server will need that transaction log to stay online even after you've taken log backups. It's very conceivable that you'll run out of drive space during the rollback.

Given how big that rollback was, and depending on your RPO/RTO goals, I'd consider failing over to your disaster recovery servers or restoring from backups. I hate to lecture you now, but this is a good time to learn the lesson that you should try everything in a development environment first to measure how long it'll take. You would have been better prepared to identify that you can't move a multi-terabyte object from one filegroup to another without a lot of planning.