When removing duplicate rows, using the below query from this tutorial, how would I go about forcing which of the found duplicates to remove?
DELETE FROM dbo.ATTENDANCE
WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE
GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
And this works great. The reason why I am using this one is because the only unique id available is one in the IDENTITY
column. And to determine if there are duplicate rows I have to look at a combination of multiple columns.
But if I have a set of duplicate rows how do I / how does SQL Server decide which to remove? And how would I force it to remove all but the one with the highest IDENTITY
value?
EMPLOYEE_ID ATTENDANCE_DATE AUTOID
A001 2011-01-01 1
A001 2011-01-01 2
If I would run the query now it happens to remove the second one, with AUTOID
2. But I am trying to remove all but this one (because this is the one latest added).
Best Answer
You could implement a query using
row_number()
to delete everything but the most recent row. This partitions the data by theemployee_id
and orders it by theautoId
column, then you delete everything that is greater than the first row number:See SQL Fiddle with Demo