SQL Server – Quickest Way to Delete Duplicate Rows

deleteduplicationsql serversql server 2014

I need to delete duplicate rows from a large table.
what is the best way to achieve that?

currently I use this algorithm:

declare @t table ([key] int  )

insert into @t select 1
insert into @t select 1
insert into @t select 1
insert into @t select 2
insert into @t select 2
insert into @t select 3
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 6
insert into @t select 6
insert into @t select 6
insert into @t select 7
insert into @t select 7
insert into @t select 8
insert into @t select 8
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9


select * from @t

; with cte as (
    select *
        , row_number() over (partition by [Key] order by [Key]) as Picker
    from @t
    )
delete cte 
where Picker > 1

select * from @t

when I run it on my system:

;WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY AccountCode ) AS [Version]
    FROM Stage.Customer
    )
    DELETE
    FROM    Customer
    WHERE   [Version] <> 1

I found that <> 1 is better than > 1.

I could create this index, currently not present:

USE [BodenDWH]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Stage].[Customer] ([AccountCode])
INCLUDE ([ID])
GO

Is there any other way to get this done?

On this occasion this table is not big – about 500,000 records on the live system.

the delete is part of a SSIS package, it runs daily and deletes about 10-15 records a day.

there are problems in the way the data is structured, I just need one AccountCode for each customer but there could be duplicates and if they are not removed, they break the package on a later stage.

It was not me who developed the package, and my scope is not to re-design anything.

I am just after the best way to get rid of the duplicates, in the quickest possible way, without having to refer to index creation, or anything, just the T-SQL code.

Best Answer

If the table is small and the number of rows you are deleting is small, then use

;WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
    FROM dbo.Customer
    )
    DELETE
    FROM    Customer
    WHERE   [Version] > 1;

Note: In above query you are using an arbitrary ordering in the window order clause ORDER BY (select null) (learned it from Itzik Ben-Gan's T-SQL Querying book and @AaronBertrand cited that above as well).

If the table is large (e.g. 5M records) then deleting in small number of rows or chunks will help not bloat transaction log and will prevent lock escalation.

A lock escalation will occur if and only if a Transact-SQL statement has acquired at least 5000 locks on a single reference of a table.

while 1=1
begin
WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
    FROM dbo.Customer
    )
    DELETE top(4000) -- choose a lower batch size than 5000 to prevent lock escalation 
    FROM    Customer
    WHERE   [Version] > 1

    if @@ROWCOUNT < 4000
    BREAK ;

end

Best Answer

Related Solutions

DB2 – Remove All Records with Duplicates

Sql-server – De-duplicate rows in place vs de-duplicate on transfer

Related Question