Sql-server – Query that Searches Duplicate Values Based on a Specific Value

duplicationsql serversql-server-2008

I've been searching through forums all morning and haven't been able to make much progress. The system I work with has a particular table that will assign two unique values to a row – we'll say ID01, and ID02. ID01 is generated by the system locally, while ID02 is generated by an external interface that connects to our system. This normally works fine, except in the scenario that the interface might be running duplicates; in this case, it generates a unique ID02 from both instances of the interface, but both have the same ID01 – resulting in a duplicate entry in the user interface.

My goal is to write a query that can be run against the database that will show me all rows on the given table that have the same ID01, but a different ID02. I started with this, as this was what most of the forum questions I could find marked as the answer to similar questions:

SELECT Count(ID01), ID01, ID02
FROM Table
GROUP BY ID01, ID02
HAVING (COUNT(ID01) > 1)
ORDER BY ID01

The result I get back doesn't work the way I need it to, however – it gives me a list where both ID01 and ID02 are duplicate at the same time, which our system already corrects for automatically. I need it instead to show me every row where ID01 is the same (both the first and duplicate instance or instances), but ID02 is different. If it makes any difference, this is being done in SQL Server 2008.

Any help would be much appreciated.

Best Answer

You can eliminate the duplicates from your query by using the ROW_NUMBER() aggregate:

IF OBJECT_ID('tempdb..#Table') IS NOT NULL
DROP TABLE #table;
CREATE TABLE #Table
(
      ID01 INT NOT NULL
    , ID02 INT NOT NULL
);

INSERT INTO #Table (ID01, ID02)
VALUES (1, 1)
     , (1, 2) --problematic
     , (1, 3) --problematic
     , (1, 4) --problematic
     , (2, 1)
     , (2, 1)
     , (2, 2) --problematic
     , (3, 1)
     , (3, 1)
     , (4, 1);

;WITH cte AS (
    SELECT DISTINCT 
          t1_ID01 = t1.ID01
        , t1_ID02 = t1.ID02
        , rn1 = ROW_NUMBER() OVER (PARTITION BY t1.ID01 ORDER BY t1.ID01, T1.ID02)
        , rn2 = ROW_NUMBER() OVER (PARTITION BY t1.ID01, t1.ID02 ORDER BY t1.ID01, T1.ID02)
    FROM #Table t1
    )
SELECT *
FROM cte
WHERE cte.rn1 > 1
    AND cte.rn2 = 1;

The first ROW_NUMBER() function, rn1, is used to select rows where there are multiple ID02 values for each individual ID01 value. The second ROW_NUMBER() function, rn2, is used to preserve the case where ID01 and ID02 have multiple duplicate values, "which our system already corrects for automatically".

That pattern can be leveraged to remove the invalid rows from the source table, by using the DELETE FROM <cte> syntax:

;WITH cte AS (
    SELECT t1.ID01
        , t1.ID02
        , rn1 = ROW_NUMBER() OVER (PARTITION BY t1.ID01 ORDER BY t1.ID01, T1.ID02)
        , rn2 = ROW_NUMBER() OVER (PARTITION BY t1.ID01, t1.ID02 ORDER BY t1.ID01, T1.ID02)
    FROM #Table t1
    )
DELETE 
FROM cte
WHERE cte.rn1 > 1
    AND cte.rn2 = 1;

SELECT *
FROM #Table;

The output; first the select, then the table after problematic rows have been removed:

Related Solutions

Mysql – Need to find duplicate entries

Suppose your table is called ingredients. Try the following:

Step 01) Create an empty delete keys table called ingredients_delete_keys

CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;

Step 02) Create PRIMARY KEY on ingredients_delete_keys

ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);

Step 03) Index the ingredients table with fk,recipe,pkey

ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);

Step 04) Populate the ingredients_delete_keys table

INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;

Step 05) Perform a DELETE JOIN on ingredients table using keys that don't match

DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;

Step 06) Drop the delete keys

DROP TABLE ingredients_delete_keys;

Step 07) Get rid of the fk_recipe_pkey_ndx index

ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;

OK Here are all the lines in one block...

CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;
ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);
ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);
INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;
DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;
DROP TABLE ingredients_delete_keys;
ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;

Give it a Try !!!

CAVEAT

Notice that using MIN function helps keep the first pkey entered for fk. If you switch it to MAX function instead, the last pkey entered for fk is kept.

SQL Server – MSG 666 Error on Insert Query in Large Indexed Table

The low selectivity issue mentioned by Remus is not sufficient on its own to cause the problem on that size table.

The uniqueifier starts at 1 and can go up to 2,147,483,646 before actually overflowing the range.

It also requires the right pattern of repeated deletes and inserts to see the issue.

CREATE TABLE T
(
X SMALLINT,
Y INT IDENTITY PRIMARY KEY NONCLUSTERED
)

CREATE CLUSTERED INDEX IX ON T(X)

INSERT INTO T VALUES (1),(1),(1),(2),(2)

Gives

+---+---+-------------+
| X | Y | Uniqueifier |
+---+---+-------------+
| 1 | 1 |             |
| 1 | 2 |           1 |
| 1 | 3 |           2 |
| 2 | 4 |             |
| 2 | 5 |           1 |
+---+---+-------------+

Then running

DELETE FROM T 
WHERE Y IN (2,3)

INSERT INTO T VALUES (1),(1)

Gives

+---+---+-------------+
| X | Y | Uniqueifier |
+---+---+-------------+
| 1 | 1 |             |
| 1 | 6 |           3 |
| 1 | 7 |           4 |
| 2 | 4 |             |
| 2 | 5 |           1 |
+---+---+-------------+

Showing in that case the uniqueifier did not reuse the values from the deleted rows.

However then running

DELETE FROM T 
WHERE Y IN (6,7)
WAITFOR DELAY '00:00:10'
INSERT INTO T VALUES (1),(1)

Gives

+---+---+-------------+
| X | Y | Uniqueifier |
+---+---+-------------+
| 1 | 1 |             |
| 1 | 8 |           1 |
| 1 | 9 |           2 |
| 2 | 4 |             |
| 2 | 5 |           1 |
+---+---+-------------+

Showing that the high water mark can be reset after deleting the duplicate with the highest uniqueifier value. The delay was to allow the ghost record cleanup process to run.

Because life is too short to insert 2 billion duplicates I then used DBCC WRITEPAGE to manually adjust the highest uniqueifier to 2,147,483,644

enter image description here

I then ran

INSERT INTO T VALUES (1)

multiple times. It succeeded twice and failed on the third attempt with error 666.

This was actually one lower than I would have assumed. Meaning that the highest uniqueifier inserted was 2,147,483,646 rather than the maximum int size of 2,147,483,647

Best Answer

Related Solutions

Mysql – Need to find duplicate entries

SQL Server – MSG 666 Error on Insert Query in Large Indexed Table

Related Question