Sql-server – SQL: How to delete Millions of Records from a Table without disturbing the server

sql server

A software of ours ran out of control and did put in millions/billions of uneccecary records in a table.
How can I delete them without disturbing the server?

This is what I have tried:
– setting transaction log to simple
– deleting them in batches in a script
– putting the virtual machine on a SSD
It still took 3 weeks before I aborted.

I don't really want to copy the data I want to keep and then delete the table.

Best Answer

You need a way to estimate the amount of time that your process will take. I also recommend a simple mechanism for logging how much progress that your code makes. That way you can avoid surprises when it runs 3 weeks when you weren't expecting it to. So start by getting that estimate. You say "millions/billions" of rows in the question but there's a big difference between the two. Which is it? You'll also want to design your looping code in a way such that the last batch takes about as long as the first batch. That way you'll get predictable performance. I don't know what kind of criteria you're using to identify bad rows so I can't give any advice for that.

The basic strategy that you want to use for the delete depends on the structure of the table and the percentage of bad rows in it:

  1. If you're deleting the majority of rows from a heap then realistically you can't design a looping paradigm that runs in constant time. The best I know if is to delete the top N rows in a loop and to stop when you delete less than N rows. Be sure to use the TABLOCK hint when doing the DELETE. Otherwise SQL Server will not free up pages from deleted rows which means that the final batch will essentially do a full scan of all of the previous rows in the table.

  2. If you're deleting the majority of rows from a clustered table then loop over the clustered index. Avoid looping over the same row more than once.

  3. If you're deleting a minority of rows from a heap or clustered table, consider creating a nonclustered index that will help identify the next batch of rows to delete. If possible you could consider a filtered index so you don't do more work than necessary when creating it. Avoid looping over the same row more than once.

There are a few things that might improve the speed of your process:

  • The delete part of the query will run single threaded so increasing CPU single core speed and getting at least 2 dedicated CPUs for your VM will help. If your VM is oversubscribed for resources that can lead to unpredictable performance. Here I'm assuming that the query to identify rows to delete is simple enough that it will be run entirely in serial. If not the number of CPUs available to SQL Server may become important.

  • Increasing the disk speed of your data and log files can help. As a general rule of thumb, switching to simple recovery model will not reduce the amount of data logged to the transaction log except if minimal logging is used. Minimal logging cannot be used for DELETE statements. All you gained by switching was not having to do transaction log backups.

  • Disabling all nonclustered indexes before doing the delete will help. Rebuild them after all of the bad rows are deleted.

The fastest delete is the one that you don't have to do. You said that you don't want to have to copy the data. That's fine, I assume you have good reasons for that. But know that you might be passing on a solution that runs in hours as opposed to days or weeks.