Merge and Remove Duplicates in MySQL – How to Guide

MySQL

We have a few MySQL (InnoDB) tables with repeated data and I want to clean them up. Consider the following:

Table user_clubs

| id | user_id | name                   | function                 | description         | school_grade |
|----|---------|------------------------|--------------------------|---------------------|--------------|
| 1  | 10      | World Cultures Club    |                          | Cultural activities | Freshman     |
| 2  | 10      | World Cultures Club    |                          | Cultural activities | Sophomore    |
| 3  | 10      | World Cultures Club    |                          | Cultural activities | Senior       |
| 4  | 20      | National Honor Society |                          | Vounteer            | Second Year  |
| 5  | 30      | National Honor Society | Public Relations Officer | School Society      | 11th         |
| 6  | 30      | National Honor Society | Public Relations Officer | School Society      | 12th         |
| 7  | 40      | Student Council        | Sophomore Secretary      |                     | Sophomore    |
| 8  | 40      | Student Council        | Junior Secretary         |                     | Junior       |

We have duplicated rows for each school_grade. What I want to end up with is:

  • Remove the school_grade column, having all the school_grade values in a user_clubs_meta table
  • Remove unnecessary duplicates in the user_clubs table, keeping only the record which has a related user_clubs_meta record

Example:

  • record id 1,2 & 3 need to get reduced to a single row and a new user_clubs_meta row with meta_key=school_grades and meta_value=["Freshman","Sophomore","Senior"] needs to get added (id column omitted for simplicity)
  • id 4 can be left as-is
  • id's 5 & 6 needs to get consolidated & meta added
  • id's 7 & 8 can be left as-is

Where I need help:

  • Populate the user_clubs_meta table. I have the following query:

    INSERT INTO user_clubs_meta (user_clubs_id, meta_key, meta_value) 
    SELECT id, 'school_grades', CONCAT('[', GROUP_CONCAT(CONCAT('"', school_grade, '"')), ']')
    FROM user_clubs 
    GROUP BY user_id, name, function, description 
    HAVING COUNT(*) > 1;
    

    Data looks good to me, but I'm not 100% positive if this gives me the correct result. If someone could confirm.

  • Remove the duplicate records from user_clubs. This is where I'm completely stuck. I guess basically do the opposite/diff of

    SELECT id
    FROM user_clubs 
    GROUP BY user_id, name, function, description
    HAVING COUNT(*) > 1;
    

    where the query would give me only a list of id's which I can remove from user_clubs unless the record has a related record in user_clubs_meta.

Any suggestion is highly appreciated.

Best Answer

The query for populating the users_club_meta table looks good to me except for two issues:

  1. I am not entirely convinced about HAVING COUNT(*) > 1. If you want to create entries for all combinations of (user_id, name, function, description), remove that condition.

    If you want to omit the groups consisting only of empty school_grade values, you can use this condition instead:

    HAVING COUNT(school_grade) > 0
    

    That is assuming that by "empty" you mean NULL. If you mean to exclude empty strings ('') as well, you could modify the above condition like this:

    HAVING COUNT(NULLIF(school_grade, '')) > 0
    
  2. You are selecting id in that query but the column is not in GROUP BY. It is true that MySQL allows you to do that but the manual also discourages that you do that in cases where the non-aggregated non-GROUP BY column has multiple values per group.

    If you want predictable results, use an aggregate function, for instance MIN():

    INSERT INTO
      user_clubs_meta (user_clubs_id, meta_key, meta_value) 
    SELECT
      MIN(id),
      'school_grades',
      CONCAT('[', GROUP_CONCAT(CONCAT('"', school_grade, '"')), ']')
    FROM
      user_clubs 
    GROUP BY
      user_id, name, function, description 
    HAVING
      COUNT(school_grade) > 0
    ;

For the DELETE query, take the final draft of the INSERT statement's SELECT query keeping only MIN(id) in the SELECT clause and removing the HAVING clause:

SELECT
  MIN(id)
FROM
  user_clubs 
GROUP BY
  user_id, name, function, description
;

That gives you the first ID of every unique combination of (user_id, name, function, description). That will be the rows you want to keep. All the others you want to remove. This is an anti-join, and there are various ways to implement it. In this case you can go with:

  • NOT IN:

    DELETE FROM
      user_clubs
    WHERE
      id NOT IN
      (
        SELECT MIN(id)
        FROM user_clubs 
        GROUP BY user_id, name, function, description
      )
    ;
    
  • LEFT JOIN + WHERE IS NULL:

    DELETE
      uc
    FROM
      user_clubs AS uc
      LEFT JOIN
      (
        SELECT MIN(id) AS min_id
        FROM user_clubs 
        GROUP BY user_id, name, function, description
      ) AS filter
      ON uc.id = filter.min_id
    WHERE
      filter.min_id IS NULL
    ;
    

The left join method is known to work fast in MySQL but you should test for yourself in your environment to choose what works better for you.