How to Delete Millions of Rows from a SQL Table

deleteperformancequery-performancesql server

I have to delete 16+ millions records from a 221+ million row table and it is going extremely slowly.

I appreciate if you share suggestions to make code below faster:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

DECLARE @BATCHSIZE INT,
        @ITERATION INT,
        @TOTALROWS INT,
        @MSG VARCHAR(500);
SET DEADLOCK_PRIORITY LOW;
SET @BATCHSIZE = 4500;
SET @ITERATION = 0;
SET @TOTALROWS = 0;

BEGIN TRY
    BEGIN TRANSACTION;

    WHILE @BATCHSIZE > 0
        BEGIN
            DELETE TOP (@BATCHSIZE) FROM MySourceTable
            OUTPUT DELETED.*
            INTO MyBackupTable
            WHERE NOT EXISTS (
                                 SELECT NULL AS Empty
                                 FROM   dbo.vendor AS v
                                 WHERE  VendorId = v.Id
                             );

            SET @BATCHSIZE = @@ROWCOUNT;
            SET @ITERATION = @ITERATION + 1;
            SET @TOTALROWS = @TOTALROWS + @BATCHSIZE;
            SET @MSG = CAST(GETDATE() AS VARCHAR) + ' Iteration: ' + CAST(@ITERATION AS VARCHAR) + ' Total deletes:' + CAST(@TOTALROWS AS VARCHAR) + ' Next Batch size:' + CAST(@BATCHSIZE AS VARCHAR);             
            PRINT @MSG;
            COMMIT TRANSACTION;
            CHECKPOINT;
        END;
END TRY
BEGIN CATCH
    IF @@ERROR <> 0
       AND @@TRANCOUNT > 0
        BEGIN
            PRINT 'There is an error occured.  The database update failed.';
            ROLLBACK TRANSACTION;
        END;
END CATCH;
GO

Execution Plan (limited for 2 iterations)

VendorId is PK and non-clustered, where clustered index is not in use by this script. There are 5 other non-unique, non-clustered indexes.

Task is "removing vendors which do not exist in another table" and back them up into another table. I have 3 tables, vendors, SpecialVendors, SpecialVendorBackups. Trying to remove SpecialVendors which do not exist in Vendors table, and to have a backup of deleted records in case what I'm doing is wrong and I have to put them back in a week or two.

Best Answer

The execution plan shows that it is reading rows from a nonclustered index in some order then performing seeks for each outer row read to evaluate the NOT EXISTS

You are deleting 7.2% of the table. 16,000,000 rows in 3,556 batches of 4,500

Assuming that the rows that qualify are evently distributed throughout the index then this means it will delete approx 1 row every 13.8 rows.

So iteration 1 will read 62,156 rows and perform that many index seeks before it finds 4,500 to delete.

iteration 2 will read 57,656 (62,156 - 4,500) rows that definitely won't qualify ignoring any concurrent updates (as they have already been processed) and then another 62,156 rows to get 4,500 to delete.

iteration 3 will read (2 * 57,656) + 62,156 rows and so on until finally iteration 3,556 will read (3,555 * 57,656) + 62,156 rows and perform that many seeks.

So the number of index seeks performed across all batches is SUM(1, 2, ..., 3554, 3555) * 57,656 + (3556 * 62156)

Which is ((3555 * 3556 / 2) * 57656) + (3556 * 62156) - or 364,652,494,976

I would suggest that you materialise the rows to delete into a temp table first

INSERT INTO #MyTempTable
SELECT MySourceTable.PK,
       1 + ( ROW_NUMBER() OVER (ORDER BY MySourceTable.PK) / 4500 ) AS BatchNumber
FROM   MySourceTable
WHERE  NOT EXISTS (SELECT *
                   FROM   dbo.vendor AS v
                   WHERE  VendorId = v.Id)

And change the DELETE to delete WHERE PK IN (SELECT PK FROM #MyTempTable WHERE BatchNumber = @BatchNumber) You may still need to include a NOT EXISTS in the DELETE query itself to cater for updates since the temp table was populated but this should be much more efficient as it will only need to perform 4,500 seeks per batch.

Related Solutions

Sql-server – Oracle GoldenGate add trandata errors

I found out what the problem is, it seems that GoldenGate doesn't work with SQL Express. The server I was connecting to is SQL Express, I'll need to use the Enterprise Edition.

MySQL – How to Delete Millions of Rows Quickly

As mustaccio states, partitioning the data may help, though that might not be practical as a quick solution and you'd still have to optimise such statements a bit.

mysql is said to be particularly inefficient with IN clauses, in this case it may be running that inner query once for every row in mytable which is not going to be efficient. Better but still far from optimal, it may be running the inner query and spooling the results into a temporary table on disk then joining on that.

To avoid IN, you can rearrange actions of the pattern:

DELETE FROM mytable1 WHERE value IN (SELECT key FROM mytable2 WHERE <filtering_condition>)

into

DELETE t1 
FROM   mytable1 t1
INNER JOIN
       mytable2 t2 
ON     t2.key = ti.id 
WHERE  <filtering_condition>

(in your case both mytable1 and mytable2 are the same table, and that works just as well)

I'm not sure how this will react to the counting-in-a-variable syntax you have there though (I'm not a mysql person specifically and that is not something seen in other DBs I do work with regularly).

If you add an integer identity column (I assume your PK here is id, time hence the current integer ID is not unique) then simply checking it against modulo 5 may be an acceptable approximation of "delete 80% evenly" like so:

DELETE t1 
FROM   mytable t1
WHERE  t1.time < "2013-08-08 00:00:00"
AND    t1.counter MOD 5 != 0

Adding that column initially will be a time consuming process but maintinaing it afterwards should not be a problem (the DB will generate a number for you on each insert, just make sure you don't include it in the VALUES list of any INSERT operation), but you don't need to use IN or JOIN at all. An index over time, counter (instead of just time) may help performance a bit more. I would be tempted to make counter the primary key and id, time a seperate index (as well as an index over time or time, counter), but that would depend a lot on your other operations on the data.

Of course once you are altering the table structure like this, do give consideration the the partitioning option too. It will be more complicated but may have significant beneficial performance impact elsewhere too.

Execution Plan (limited for 2 iterations)

Best Answer

Related Solutions

Sql-server – Oracle GoldenGate add trandata errors

MySQL – How to Delete Millions of Rows Quickly

Related Question