MySQL Query – Remove Duplicate Records with Special Rules

duplicationMySQLquery

I have the following data in a table:

id    loc1    loc2    zip_code
------------------------------
10    1       null    12345
10    null    1       null

11    1       null    null
11    null    1       43210

12    1       null    54321
12    null    1       87654

13    1       null    null
13    null    1       null

14    1       null    65432
15    1       null    null
16    null    1       76767
17    null    1       null

The goal is to remove the duplicates with some special rules.

To accomplish this, I have the option to delete the "bad" records directly from my original table or copy the "good" records into a new table.

Rules:

  • Each record always has one location set. i.e. we either have loc1 or loc2 for each row.
  • The unique rows (id: 14-17) are good as they are
  • For the duplicate rows (id: 10-13), I want to choose only one row:
    • The row that has a zip_code
    • For all other cases, we must always choose the primary location. ie. the one that has loc1

After de-duplication, the final data should look like this:

id    loc1    loc2    zip_code
------------------------------
10    1       null    12345
11    null    1       43210
12    1       null    54321
13    1       null    null

14    1       null    65432
15    1       null    null
16    null    1       76767
17    null    1       null

Best Answer

WITH cte AS ( SELECT *,
                     ROW_NUMBER() OVER (PARTITION BY id ORDER BY zip_code IS NULL, loc1 IS NULL) rn
              FROM test )
SELECT id, loc1, loc2, zip_code
FROM cte
WHERE rn = 1
ORDER BY id;

If I want to use this for deleting the duplicates, all I need to do is change WHERE rn = 2. Is that correct? – advncd

@advncd No, MySQL does not support updatable CTE. You must use this query as a subquery which selects rows which must be stored while deleting from another copy of source table.

DELETE t1.*
FROM test t1
JOIN (WITH cte AS ( SELECT *,
                           ROW_NUMBER() OVER (PARTITION BY id ORDER BY zip_code IS NULL, loc1 IS NULL) rn
                    FROM test )
      SELECT id, loc1, loc2, zip_code
      FROM cte
      WHERE rn > 1) t2 ON t1.id = t2.id
                      AND t1.loc1 <=> t2.loc1
                      AND t1.loc2 <=> t2.loc2
                      AND t1.zip_code <=> t2.zip_code;

fiddle