I accidentally introduced a bug in my software that caused it to create duplicate entries. Now I have a lot of duplicate entries, the total rows of that table is around 111 million.
This is my table structure for that table.
CREATE TABLE `graph_player_ranked` (
`player_id` int(11) NOT NULL,
`season_id` int(11) NOT NULL,
`playlist_id` int(11) NOT NULL,
`timestamp` datetime NOT NULL,
`matches_played` int(11) DEFAULT NULL,
`rank_points` int(11) DEFAULT NULL,
`tier` int(11) DEFAULT NULL,
`division` int(11) DEFAULT NULL,
`no_update_at` datetime NOT NULL,
PRIMARY KEY (`player_id`,`season_id`,`playlist_id`,`timestamp`),
KEY `player_id` (`player_id`,`playlist_id`,`timestamp`),
KEY `idx_timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Please don't comment about the possibly bad structure, I am switching to a new structure but I have to fix this issue first. However, should I remove timestamp
from the PK?
The idea was to only add a new row when "rank_points" OR "tier" OR "division" changed OR "matches_played" increased. At some time I accidentally made it create a new row when "matches_played" was the same as the previous row or larger.
Example of duplicate data:
I want to remove all duplicate entries with just one row. So for the above image that would be:
- Remove all of the rows in a red rectangle
- Grab "timestamp" from the first row in the rectangle.
- Grab "no_update_at" timestamp from the last row in the rectangle.
- Place one row back with all previous data and the above two timestamps.
I have no idea how to achieve this, the only solution I can think of is creating an external tool to do this. But that would probably take very long to process all rows.
Edit: Sample data for two players can be found here https://dl.dropboxusercontent.com/u/184110260/sample_data/graph_player_ranked.sql
Best Answer
Your data sample seems to suggest that the two
datetime
columns are the only columns that can change in each group of duplicates. In other words, duplicates are determined by columnsplayer_id, season_id, playlist_id, matches_played, rank_points, tier, division
.One other assumption that your data sample prompts is that a row with a greater
timestamp
corresponds to a row with a greaterno_update_at
as well. (In fact, the two columns always have the same value in your example.)Finally, for the purpose of this answer, I am also going to assume that when you are saying "first" or "last" (row in the rectangle), you are talking about the earliest or latest datetime value.
With those out of the way, you can generate the desired result set from your table if you just group the rows by the columns listed above and take
MIN(timestamp)
fortimestamp
andMAX(no_update_at)
forno_update_at
– that is, like this:You could store that in a temporary table:
then truncate the source:
and populate it from the temporary table:
Of course, if the table is referenced by other tables, you cannot simply remove all data – you will probably need to disable or remove the foreign keys before re-populating the table and enable or recreate them afterwards. (That is an entirely different problem that would need to be addressed separately, in case you need help with it.)