MySQL 5.7 – Is It Possible to Clean Up Duplicate Entries with Just SQL?

MySQLmysql-5.7

I accidentally introduced a bug in my software that caused it to create duplicate entries. Now I have a lot of duplicate entries, the total rows of that table is around 111 million.

This is my table structure for that table.

CREATE TABLE `graph_player_ranked` (
  `player_id` int(11) NOT NULL,
  `season_id` int(11) NOT NULL,
  `playlist_id` int(11) NOT NULL,
  `timestamp` datetime NOT NULL,
  `matches_played` int(11) DEFAULT NULL,
  `rank_points` int(11) DEFAULT NULL,
  `tier` int(11) DEFAULT NULL,
  `division` int(11) DEFAULT NULL,
  `no_update_at` datetime NOT NULL,
  PRIMARY KEY (`player_id`,`season_id`,`playlist_id`,`timestamp`),
  KEY `player_id` (`player_id`,`playlist_id`,`timestamp`),
  KEY `idx_timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Please don't comment about the possibly bad structure, I am switching to a new structure but I have to fix this issue first. However, should I remove timestamp from the PK?

The idea was to only add a new row when "rank_points" OR "tier" OR "division" changed OR "matches_played" increased. At some time I accidentally made it create a new row when "matches_played" was the same as the previous row or larger.

Example of duplicate data:

Data sample

I want to remove all duplicate entries with just one row. So for the above image that would be:

  1. Remove all of the rows in a red rectangle
  2. Grab "timestamp" from the first row in the rectangle.
  3. Grab "no_update_at" timestamp from the last row in the rectangle.
  4. Place one row back with all previous data and the above two timestamps.

I have no idea how to achieve this, the only solution I can think of is creating an external tool to do this. But that would probably take very long to process all rows.

Edit: Sample data for two players can be found here https://dl.dropboxusercontent.com/u/184110260/sample_data/graph_player_ranked.sql

Best Answer

Your data sample seems to suggest that the two datetime columns are the only columns that can change in each group of duplicates. In other words, duplicates are determined by columns player_id, season_id, playlist_id, matches_played, rank_points, tier, division.

One other assumption that your data sample prompts is that a row with a greater timestamp corresponds to a row with a greater no_update_at as well. (In fact, the two columns always have the same value in your example.)

Finally, for the purpose of this answer, I am also going to assume that when you are saying "first" or "last" (row in the rectangle), you are talking about the earliest or latest datetime value.

With those out of the way, you can generate the desired result set from your table if you just group the rows by the columns listed above and take MIN(timestamp) for timestamp and MAX(no_update_at) for no_update_at – that is, like this:

SELECT
  player_id,
  season_id,
  playlist_id,
  MIN(timestamp) AS timestamp,
  matches_played,
  rank_points,
  tier,
  division,
  MAX(no_update_at) AS no_update_at
FROM
  graph_player_ranked
GROUP BY
  player_id,
  season_id,
  playlist_id,
  matches_played,
  rank_points,
  tier,
  division
;

You could store that in a temporary table:

CREATE TEMPORARY TABLE tmp_graph_player_ranked
AS
SELECT
  ...
;

then truncate the source:

TRUNCATE graph_player_ranked;

and populate it from the temporary table:

INSERT INTO
  graph_player_ranked
    (
      player_id,
      season_id,
      playlist_id,
      timestamp,
      matches_played,
      rank_points,
      tier,
      division,
      no_update_at
    )
SELECT
  player_id,
  season_id,
  playlist_id,
  timestamp,
  matches_played,
  rank_points,
  tier,
  division,
  no_update_at
FROM
  tmp_graph_player_ranked
;

Of course, if the table is referenced by other tables, you cannot simply remove all data – you will probably need to disable or remove the foreign keys before re-populating the table and enable or recreate them afterwards. (That is an entirely different problem that would need to be addressed separately, in case you need help with it.)