I received a database with a few million records in it, but apperently there might be duplicate records in them.
A user enters data into the database and a primary key is generated, however if the user enters the same data again, a new primary key will be generated for that data, even though the data has already been entered before. There are no checks on this.
I need to go looking for these duplicates, but I do not really know where to start. I first thought concatenating all cells except the primary key in a subquery and then count these rows and see which ones have a count higher than 1.
cfr.
pkey recipe fkey comment
1 toast 3 tasty
2 curry 2 spicy
3 curry 2 spicy
4 bread 1 crumbly
5 orios 2 cookies
Here the curry entries are identical and I'd have to delete 1 of those.
However I read concatenating is unpredictable in mysql and it just feels a bit wrong to me as well.
Any hints ?
Best Answer
Suppose your table is called
ingredients
. Try the following:Step 01) Create an empty delete keys table called
ingredients_delete_keys
Step 02) Create PRIMARY KEY on
ingredients_delete_keys
Step 03) Index the
ingredients
table with fk,recipe,pkeyStep 04) Populate the
ingredients_delete_keys
tableStep 05) Perform a DELETE JOIN on ingredients table using keys that don't match
Step 06) Drop the delete keys
Step 07) Get rid of the
fk_recipe_pkey_ndx
indexOK Here are all the lines in one block...
Give it a Try !!!
CAVEAT
Notice that using MIN function helps keep the first pkey entered for fk. If you switch it to MAX function instead, the last pkey entered for fk is kept.