How to Remove Specific Duplicates in SQL Server (All But Latest)

sql serversql-server-2012

When removing duplicate rows, using the below query from this tutorial, how would I go about forcing which of the found duplicates to remove?

DELETE FROM dbo.ATTENDANCE 
    WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE 
    GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE) 

And this works great. The reason why I am using this one is because the only unique id available is one in the IDENTITY column. And to determine if there are duplicate rows I have to look at a combination of multiple columns.

But if I have a set of duplicate rows how do I / how does SQL Server decide which to remove? And how would I force it to remove all but the one with the highest IDENTITY value?

EMPLOYEE_ID     ATTENDANCE_DATE     AUTOID
A001            2011-01-01          1
A001            2011-01-01          2

If I would run the query now it happens to remove the second one, with AUTOID 2. But I am trying to remove all but this one (because this is the one latest added).

Best Answer

You could implement a query using row_number() to delete everything but the most recent row. This partitions the data by the employee_id and orders it by the autoId column, then you delete everything that is greater than the first row number:

;with cte as
(
  select [EMPLOYEE_ID], [ATTENDANCE_DATE], [AUTOID],
    row_number() over(partition by [EMPLOYEE_ID], [ATTENDANCE_DATE] 
                      order by  [AUTOID] desc) rn
  from dbo.ATTENDANCE
)
delete 
from cte 
where rn > 1;

See SQL Fiddle with Demo