Suppose your table is called ingredients
. Try the following:
Step 01) Create an empty delete keys table called ingredients_delete_keys
CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;
Step 02) Create PRIMARY KEY on ingredients_delete_keys
ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);
Step 03) Index the ingredients
table with fk,recipe,pkey
ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);
Step 04) Populate the ingredients_delete_keys
table
INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;
Step 05) Perform a DELETE JOIN on ingredients table using keys that don't match
DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;
Step 06) Drop the delete keys
DROP TABLE ingredients_delete_keys;
Step 07) Get rid of the fk_recipe_pkey_ndx
index
ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;
OK Here are all the lines in one block...
CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;
ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);
ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);
INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;
DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;
DROP TABLE ingredients_delete_keys;
ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;
Give it a Try !!!
CAVEAT
Notice that using MIN function helps keep the first pkey entered for fk. If you switch it to MAX function instead, the last pkey entered for fk is kept.
The low selectivity issue mentioned by Remus is not sufficient on its own to cause the problem on that size table.
The uniqueifier starts at 1
and can go up to 2,147,483,646
before actually overflowing the range.
It also requires the right pattern of repeated deletes and inserts to see the issue.
CREATE TABLE T
(
X SMALLINT,
Y INT IDENTITY PRIMARY KEY NONCLUSTERED
)
CREATE CLUSTERED INDEX IX ON T(X)
INSERT INTO T VALUES (1),(1),(1),(2),(2)
Gives
+---+---+-------------+
| X | Y | Uniqueifier |
+---+---+-------------+
| 1 | 1 | |
| 1 | 2 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | |
| 2 | 5 | 1 |
+---+---+-------------+
Then running
DELETE FROM T
WHERE Y IN (2,3)
INSERT INTO T VALUES (1),(1)
Gives
+---+---+-------------+
| X | Y | Uniqueifier |
+---+---+-------------+
| 1 | 1 | |
| 1 | 6 | 3 |
| 1 | 7 | 4 |
| 2 | 4 | |
| 2 | 5 | 1 |
+---+---+-------------+
Showing in that case the uniqueifier did not reuse the values from the deleted rows.
However then running
DELETE FROM T
WHERE Y IN (6,7)
WAITFOR DELAY '00:00:10'
INSERT INTO T VALUES (1),(1)
Gives
+---+---+-------------+
| X | Y | Uniqueifier |
+---+---+-------------+
| 1 | 1 | |
| 1 | 8 | 1 |
| 1 | 9 | 2 |
| 2 | 4 | |
| 2 | 5 | 1 |
+---+---+-------------+
Showing that the high water mark can be reset after deleting the duplicate with the highest uniqueifier value. The delay was to allow the ghost record cleanup process to run.
Because life is too short to insert 2 billion duplicates I then used DBCC WRITEPAGE
to manually adjust the highest uniqueifier
to 2,147,483,644
![enter image description here](https://i.stack.imgur.com/SDza7.png)
I then ran
INSERT INTO T VALUES (1)
multiple times. It succeeded twice and failed on the third attempt with error 666.
This was actually one lower than I would have assumed. Meaning that the highest uniqueifier inserted was 2,147,483,646 rather than the maximum int size of 2,147,483,647
Best Answer
You can eliminate the duplicates from your query by using the
ROW_NUMBER()
aggregate:The first
ROW_NUMBER()
function,rn1
, is used to select rows where there are multiple ID02 values for each individual ID01 value. The secondROW_NUMBER()
function,rn2
, is used to preserve the case where ID01 and ID02 have multiple duplicate values, "which our system already corrects for automatically".That pattern can be leveraged to remove the invalid rows from the source table, by using the
DELETE FROM <cte>
syntax:The output; first the
select
, then the table after problematic rows have been removed: