How to Remove Specific Duplicates in SQL Server (All But Latest)

sql serversql-server-2012

When removing duplicate rows, using the below query from this tutorial, how would I go about forcing which of the found duplicates to remove?

DELETE FROM dbo.ATTENDANCE 
    WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE 
    GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)

And this works great. The reason why I am using this one is because the only unique id available is one in the IDENTITY column. And to determine if there are duplicate rows I have to look at a combination of multiple columns.

But if I have a set of duplicate rows how do I / how does SQL Server decide which to remove? And how would I force it to remove all but the one with the highest IDENTITY value?

EMPLOYEE_ID     ATTENDANCE_DATE     AUTOID
A001            2011-01-01          1
A001            2011-01-01          2

If I would run the query now it happens to remove the second one, with AUTOID 2. But I am trying to remove all but this one (because this is the one latest added).

Best Answer

You could implement a query using row_number() to delete everything but the most recent row. This partitions the data by the employee_id and orders it by the autoId column, then you delete everything that is greater than the first row number:

;with cte as
(
  select [EMPLOYEE_ID], [ATTENDANCE_DATE], [AUTOID],
    row_number() over(partition by [EMPLOYEE_ID], [ATTENDANCE_DATE] 
                      order by  [AUTOID] desc) rn
  from dbo.ATTENDANCE
)
delete 
from cte 
where rn > 1;

See SQL Fiddle with Demo

Related Solutions

Sql-server – Inserting rows into other table whilst preserving IDENTITY

While it doesn't automatically prevent duplicates, you can disable the identity temporarily using the following, and then you would likely just want to set the identity seed to the highest value in the table:

 SET IDENTITY_INSERT dbo.tablename ON;

 INSERT ...

 SET IDENTITY_INSERT dbo.tablename OFF;

 DECLARE @sql NVARCHAR(MAX);

 SELECT @sql = N'DBCC CHECKIDENT(''dbo.tablename'', RESEED, '
   + RTRIM(MAX(id_column_name)) + ');' FROM dbo.tablename;

 EXEC sp_executesql @sql;

I'm not sure what your best course of action would be to correct duplicates. If you insert 1000 new rows after reseeding, it is likely that the source system will generate new identity values that will conflict. What you might consider doing is simply setting one of the identity values to generate numbers well above the range that the other table won't ever get to (say 1 billion). You can still use IDENTITY_INSERT to merge, but there will never be a conflict. This also makes it very easy to determine whether a row was generated locally or imported.

SQL Server – How to Add a Unique Constraint Ignoring Existing Violations

The answer is "yes". You can do this with a filtered index (see here for documentation).

For instance, you can do:

create unique index t_col on t(col) where id > 1000;

This creates a unique index, only on new rows, rather than on the old rows. This particular formulation would allow duplicates with existing values.

If you have just a handful of duplicates, you could do something like:

create unique index t_col on t(col) where id not in (<list of ids for duplicate values here>);

Best Answer

Related Solutions

Sql-server – Inserting rows into other table whilst preserving IDENTITY

SQL Server – How to Add a Unique Constraint Ignoring Existing Violations

Related Question