I need to delete duplicate rows from a large table.
what is the best way to achieve that?
currently I use this algorithm:
declare @t table ([key] int )
insert into @t select 1
insert into @t select 1
insert into @t select 1
insert into @t select 2
insert into @t select 2
insert into @t select 3
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 6
insert into @t select 6
insert into @t select 6
insert into @t select 7
insert into @t select 7
insert into @t select 8
insert into @t select 8
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
select * from @t
; with cte as (
select *
, row_number() over (partition by [Key] order by [Key]) as Picker
from @t
)
delete cte
where Picker > 1
select * from @t
when I run it on my system:
;WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY AccountCode ) AS [Version]
FROM Stage.Customer
)
DELETE
FROM Customer
WHERE [Version] <> 1
I found that <> 1 is better than > 1.
I could create this index, currently not present:
USE [BodenDWH]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Stage].[Customer] ([AccountCode])
INCLUDE ([ID])
GO
Is there any other way to get this done?
On this occasion this table is not big – about 500,000 records on the live system.
the delete is part of a SSIS package, it runs daily and deletes about 10-15 records a day.
there are problems in the way the data is structured, I just need one AccountCode for each customer but there could be duplicates and if they are not removed, they break the package on a later stage.
It was not me who developed the package, and my scope is not to re-design anything.
I am just after the best way to get rid of the duplicates, in the quickest possible way, without having to refer to index creation, or anything, just the T-SQL code.
Best Answer
If the table is small and the number of rows you are deleting is small, then use
Note: In above query you are using an arbitrary ordering in the window order clause
ORDER BY (select null)
(learned it from Itzik Ben-Gan's T-SQL Querying book and @AaronBertrand cited that above as well).If the table is large (e.g. 5M records) then deleting in small number of rows or chunks will help not bloat transaction log and will prevent lock escalation.