Mysql – Need to find duplicate entries

duplicationMySQL

I received a database with a few million records in it, but apperently there might be duplicate records in them.

A user enters data into the database and a primary key is generated, however if the user enters the same data again, a new primary key will be generated for that data, even though the data has already been entered before. There are no checks on this.

I need to go looking for these duplicates, but I do not really know where to start. I first thought concatenating all cells except the primary key in a subquery and then count these rows and see which ones have a count higher than 1.

cfr.

pkey    recipe     fkey    comment
1   toast       3       tasty
2   curry       2       spicy
3   curry       2       spicy
4   bread       1           crumbly
5   orios       2       cookies

Here the curry entries are identical and I'd have to delete 1 of those.

However I read concatenating is unpredictable in mysql and it just feels a bit wrong to me as well.

Any hints ?

Best Answer

Suppose your table is called ingredients. Try the following:

Step 01) Create an empty delete keys table called ingredients_delete_keys

CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;

Step 02) Create PRIMARY KEY on ingredients_delete_keys

ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);

Step 03) Index the ingredients table with fk,recipe,pkey

ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);

Step 04) Populate the ingredients_delete_keys table

INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;

Step 05) Perform a DELETE JOIN on ingredients table using keys that don't match

DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;

Step 06) Drop the delete keys

DROP TABLE ingredients_delete_keys;

Step 07) Get rid of the fk_recipe_pkey_ndx index

ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;

OK Here are all the lines in one block...

CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;
ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);
ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);
INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;
DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;
DROP TABLE ingredients_delete_keys;
ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;

Give it a Try !!!

CAVEAT

Notice that using MIN function helps keep the first pkey entered for fk. If you switch it to MAX function instead, the last pkey entered for fk is kept.

Related Solutions

MySQL – How to Preserve ID Generated from PRIMARY KEY When Moving Data

When you have a Primary Key with an auto_increment it will generate a new ID only if you insert a NULL value. If you set ID=4 in your INSERT, the ID will be 4 so you'll not loose your ID during your "move" operation.

We don't have the "SEQUENCE" notion like in Oracle database so your "global ID" problem it's not so easy to do.

Maybe you can try something like this (but it'll add complications for just a 4 millions rows table)

Create a table used for generates your "Global ID", with one int filed auto_incremented:

CREATE TABLE test.sequence_table (next_id int primary key auto_increment);

When you want insert a new row in your child table:

Solution 1: With SELECT in information_schema

BEGIN; -- Start a new Transaction to ensure consistency

INSERT INTO test.sequence_table values (NULL); -- Generate a new ID

SELECT @next_ID:=(auto_increment - 1) FROM information_schema.tables WHERE table_schema="test" AND table_name="sequence_table"; -- Here I use a MySQL Variable but you can store it in PHP or whatever

INSERT INTO child_table values (null, @next_ID, "Max", "SQL"); -- Use your variable

COMMIT; -- Wonderfull :)

Edit after ypercube comment:

Solution 2: With LAST_INSERT_ID()

BEGIN; -- Start a new Transaction to ensure consistency

INSERT INTO test.sequence_table values (NULL); -- Generate a new ID

SELECT @next_ID:=LAST_INSERT_ID(); -- Use of the MySQL function LAST_INSERT_ID()

INSERT INTO child_table values (null, @next_ID, "Max", "SQL"); -- Use your variable

COMMIT; -- Wonderfull :)

MySQL – Should a Multi-Column UNIQUE Index Be Created?

The good thing with unique indexes is that search stops when the first value matches, but that requires the WHERE part to match exactly with the index. In your case the index will be big. If you are lucky the value might be found quickly on the b-tree, else it might need to scan almost the entire index.

Best Answer

Related Solutions

MySQL – How to Preserve ID Generated from PRIMARY KEY When Moving Data

MySQL – Should a Multi-Column UNIQUE Index Be Created?

Related Question