MySQL Deduplicate Data Using HAVING with Subquery

duplicationfunctionsMySQLsubquery

I've inherited a DB that was lacking a unique constraint so data was/is duplicated. I'm now trying to remove the duplicate records, then add a constraint to block this going forward.

I have this query:

SELECT count(*) as dacount, substr(group_concat(id), (locate(',', group_concat(id))+ 1))
FROM `game`
group by matchid, ordinal
having dacount > 1
order by dacount desc

which correctly gives me the ids of the rows I need to delete. The problem with this though is that I can't use this as a subquery on the DELETE because of the dacount with the having parameter. Is there another way to do this?

Here was my plan:

DELETE FROM game WHERE id IN (SELECT count(*) as dacount, substr(group_concat(id), (locate(',', group_concat(id))+ 1))
FROM `game`
GROUP BY matchid, ordinal
HAVING dacount > 1)

Best Answer

There are restrictions regarding self-joins in MySQL's implementation of UPDATE and DELETE. The issue is usually solvable by using joins:

DELETE g.*
FROM game AS g
  JOIN 
   ( SELECT matchid, ordinal, min(id) AS id 
     FROM game
     GROUP BY matchid, ordinal
     -- HAVING count(*) > 1
   ) AS d
   ON  g.matchid = d.matchid
   AND g.ordinal = d.ordinal
   AND g.id > d.id ;

Test at dbfiddle.uk.


Another way would be using a LEFT JOIN / IS NULL construction:

DELETE g.*
FROM game AS g
  LEFT JOIN 
   ( SELECT min(id) AS id 
     FROM game
     GROUP BY matchid, ordinal
   ) AS d
   ON  g.id = d.id 
WHERE d.id IS NULL ;

or an EXISTS subquery rewritten as a JOIN
(delete all games that there exists another game with same matchid, ordinal and smaller id):

DELETE g.*
FROM game AS g
  JOIN 
     game AS d
   ON  g.matchid = d.matchid
   AND g.ordinal = d.ordinal
   AND g.id > d.id
 ;