Sql-server – tuning huge delete operation on sql server table

deletesql server

I am performing a delete operation on very large sql server table based on query as discussed below.

delete db.st_table_1
where service_date between(select min(service_date) from stg_table)
                   and (select max(service_date) from stg_table);

stg_table and stg_table_1 doesn't have indexes on service_date.

both of these tables are loaded with million rows of data and delete operation is taking lot of time. Requesting your suggestion to improve the performance of this query.

I referred to strategy described in below question but couldn't understand how to implement it.

How to delete large amount of data in sql server without data loss?

requesting your kind suggestion on this.

Update:

select * into db.temp_stg_table_1
from db.stg_table_1
where service_date not between( select min(service_date) from db.stg_table)
                             and (select max(service_date) from db.stg_table);

exec sp_rename 'stg_table_1' , 'stg_table_1_old'

exec sp_rename 'temp_stg_table_1' , 'test_table_1'

drop table stg_table_1_old

how about if go with above logic to delete the millions of records. any advantages and disadvantages with that.

Best Answer

Testing based on your comments

Tested on SQL Server 2014 SP3

stg_table and stg_table_1 doesn't have indexes on service_date.

both of these tables are loaded with million rows of data and delete operation is taking lot of time.

DDL

CREATE TABLE dbo.st_table_1( stg_table_1_ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
                             service_date datetime2,
                            val  int)
CREATE TABLE dbo.stg_table (stg_table_ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
                            service_date datetime2,
                            val  int)

PK's + Clustered indexes on identity fields.

DML

INSERT INTO dbo.stg_table WITH(TABLOCK)
(
service_date,val) 
SELECT -- 1M
 DATEADD(S,rownum,GETDATE()),rownum
 FROM
 (SELECT TOP(1000000) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as rownum
FROM master.dbo.spt_values spt1
CROSS APPLY master.dbo.spt_values spt2) as sptvalues

INSERT INTO dbo.st_table_1 WITH(TABLOCK)
(
service_date,val) 
SELECT -- 2.5M
 DATEADD(S,rownum,GETDATE()),rownum
 FROM
 (SELECT TOP(2500000) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as rownum
FROM master.dbo.spt_values spt1
CROSS APPLY master.dbo.spt_values spt2) as sptvalues

INSERT INTO dbo.stg_table WITH(TABLOCK)
(
service_date,val) 
SELECT -- 4M
 DATEADD(S,rownum,GETDATE()),rownum
 FROM
 (SELECT TOP(4000000) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as rownum
FROM master.dbo.spt_values spt1
CROSS APPLY master.dbo.spt_values spt2) as sptvalues

2.5M rows in dbo.st_table_1 and 5M rows in dbo.stg_table (Almost) all of these 2.5M rows will be deleted by the query Which is a more than 10 times less than yours.

Running your query

The actual execution plan for your base delete statement

As expected dbo.stg_table is accessed twice to get the max & min values with a stream aggregate. The cpu time & elapsed / execution time:

  CPU time = 4906 ms,  elapsed time = 4919 ms.

A missing index hint is added to the execution plan:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[st_table_1] ([service_date])
INCLUDE ([stg_table_1_ID])

However, when we add the index, an extra sort appears to delete the rows from this newly added index:

enter image description here

The plan

And the cpu time / elapsed time increases:

   CPU time = 11156 ms,  elapsed time = 11332 ms.

YMMV, but from my example, based on your comments about data, it did not improve the query.

Creating an index on [dbo].[stg_table]

CREATE NONCLUSTERED INDEX IX_service_date
ON [dbo].[stg_table] ([service_date]);

As a result the MAX() and MIN() can leverage the newly created index to return only one row instead of a full clustered index scan:

enter image description here

With the execution time improved:

 SQL Server Execution Times:
   CPU time = 2609 ms,  elapsed time = 4028 ms.

And the execution plan

But this is only based on indexing & my own example. Proceed at your own risk.


Extra Notes

You should look into splitting that delete into separate batches so it won't fill up the log file & not having one big block of failed / succeeded delete .

You could also consider using (TABLOCK) so the entire table is locked from the very beginning.

SET STATISTICS IO, TIME ON;
delete dbo.st_table_1 WITH(TABLOCK)
where service_date between(select min(service_date) from stg_table)
                   and (select max(service_date) from stg_table);

Update: SELECT INTO + sp_rename

select * into db.temp_stg_table_1
from db.stg_table_1
where service_date not between( select min(service_date) from db.stg_table)
                             and (select max(service_date) from db.stg_table);

exec sp_rename 'stg_table_1' , 'stg_table_1_old'

exec sp_rename 'temp_stg_table_1' , 'test_table_1'

drop table stg_table_1_old

how about if go with above logic to delete the millions of records. any advantages and disadvantages with that.

Apart from performance, sp_rename needs a Sch-M lock to complete, meaning that it has to wait for all other sessions to release their locks on the table before it can be modified. Any indexes/constraints on the original table will be gone and you will have to recreate them.

When I run the query on my own data:

select * into dbo.temp_stg_table_1
from dbo.st_table_1
where service_date not between( select min(service_date) from dbo.stg_table)
                             and (select max(service_date) from dbo.stg_table);

This does not represent your data, keep that in mind.

It is reading all rows to return 0 which is not optimal.

enter image description here

With a high execution time:

 SQL Server Execution Times:
   CPU time = 27717 ms,  elapsed time = 10657 ms.

But this is not really meaningful without more information regarding your data. A query plan would be needed to give out more correct advice.