Sql-server – Query that Searches Duplicate Values Based on a Specific Value

duplicationsql serversql-server-2008

I've been searching through forums all morning and haven't been able to make much progress. The system I work with has a particular table that will assign two unique values to a row – we'll say ID01, and ID02. ID01 is generated by the system locally, while ID02 is generated by an external interface that connects to our system. This normally works fine, except in the scenario that the interface might be running duplicates; in this case, it generates a unique ID02 from both instances of the interface, but both have the same ID01 – resulting in a duplicate entry in the user interface.

My goal is to write a query that can be run against the database that will show me all rows on the given table that have the same ID01, but a different ID02. I started with this, as this was what most of the forum questions I could find marked as the answer to similar questions:

SELECT Count(ID01), ID01, ID02
FROM Table
GROUP BY ID01, ID02
HAVING (COUNT(ID01) > 1)
ORDER BY ID01

The result I get back doesn't work the way I need it to, however – it gives me a list where both ID01 and ID02 are duplicate at the same time, which our system already corrects for automatically. I need it instead to show me every row where ID01 is the same (both the first and duplicate instance or instances), but ID02 is different. If it makes any difference, this is being done in SQL Server 2008.

Any help would be much appreciated.

Best Answer

You can eliminate the duplicates from your query by using the ROW_NUMBER() aggregate:

IF OBJECT_ID('tempdb..#Table') IS NOT NULL
DROP TABLE #table;
CREATE TABLE #Table
(
      ID01 INT NOT NULL
    , ID02 INT NOT NULL
);

INSERT INTO #Table (ID01, ID02)
VALUES (1, 1)
     , (1, 2) --problematic
     , (1, 3) --problematic
     , (1, 4) --problematic
     , (2, 1)
     , (2, 1)
     , (2, 2) --problematic
     , (3, 1)
     , (3, 1)
     , (4, 1);

;WITH cte AS (
    SELECT DISTINCT 
          t1_ID01 = t1.ID01
        , t1_ID02 = t1.ID02
        , rn1 = ROW_NUMBER() OVER (PARTITION BY t1.ID01 ORDER BY t1.ID01, T1.ID02)
        , rn2 = ROW_NUMBER() OVER (PARTITION BY t1.ID01, t1.ID02 ORDER BY t1.ID01, T1.ID02)
    FROM #Table t1
    )
SELECT *
FROM cte
WHERE cte.rn1 > 1
    AND cte.rn2 = 1;

The first ROW_NUMBER() function, rn1, is used to select rows where there are multiple ID02 values for each individual ID01 value. The second ROW_NUMBER() function, rn2, is used to preserve the case where ID01 and ID02 have multiple duplicate values, "which our system already corrects for automatically".

That pattern can be leveraged to remove the invalid rows from the source table, by using the DELETE FROM <cte> syntax:

;WITH cte AS (
    SELECT t1.ID01
        , t1.ID02
        , rn1 = ROW_NUMBER() OVER (PARTITION BY t1.ID01 ORDER BY t1.ID01, T1.ID02)
        , rn2 = ROW_NUMBER() OVER (PARTITION BY t1.ID01, t1.ID02 ORDER BY t1.ID01, T1.ID02)
    FROM #Table t1
    )
DELETE 
FROM cte
WHERE cte.rn1 > 1
    AND cte.rn2 = 1;

SELECT *
FROM #Table;

The output; first the select, then the table after problematic rows have been removed:

enter image description here