Mysql – How to delete repeated rows

deleteMySQL

I have a system that gathers RSS from many sites including UTF-8 based sites (non-english sites). I have tried to avoid repeated rows but I have repeats in my table.

RSS_items (id(auto_increment) PRIMARY KEY, url ,title, body)

Is there any way to delete the repeat? The problem is that sometimes this repeat happens and sometimes not. The first thing I think I must find those who are repeated, then delete the extra rows. But how ?

Best Answer

This would depend on which one of the repeat you want to keep. For this example, I will dedup the table based on url and title and keeping the first occurrence.

First thing to do: run these queries

SELECT COUNT(1) FROM RSS_items;

SELECT COUNT(1) dupcount_two,url,title FROM RSS_items
GROUP BY url,title HAVING COUNT(1) = 2 WITH ROLLUP;

SELECT COUNT(1) dupcount_morethantwo,url,title FROM RSS_items
GROUP BY url,title HAVING COUNT(1) > 2 WITH ROLLUP;

This will show total table count and how all (url,title) combinations that repeat twice and more than twice with the total for each at the bottom. If the sum of the repeats is more than 5% of the total:

CREATE TABLE RSS_items_URLTitle LIKE RSS_items;
INSERT INTO RSS_items_URLTitle (id,url,title)
    SELECT MIN(id),url,title
    FROM RSS_items GROUP BY url,title
    HAVING 
;
DELETE B.*
    FROM RSS_items_URLTitle A
    INNER JOIN RSS_items B
    USING (url,title) WHERE A.id <> B.id
;
DROP TABLE RSS_items_URLTitle;

This method made be very slow if you do not have an index on (url,title)

Otherwise, run this

CREATE TABLE RSS_items_New LIKE RSS_items;
CREATE TABLE RSS_items_URLTitle LIKE RSS_items;
INSERT INTO RSS_items_URLTitle (id,url,title)
    SELECT MIN(id),url,title
    FROM RSS_items GROUP BY url,title
;
INSERT INTO RSS_items_New
    SELECT B.*
    FROM RSS_items_URLTitle A
    INNER JOIN RSS_items USING (id)
;
ALTER TABLE RSS_items RENAME RSS_items_Zap;
ALTER TABLE RSS_items_New RENAME RSS_items;
DROP TABLE RSS_items_URLTitle;
DROP TABLE RSS_items_Zap;

This method may be better since it only operates on the table id.

Please look over both methods. Try testing it on test databases with copies of the data before running anything.

If you want to keep the last occurrence of duplicates, replace MIN(id) with MAX(id).

Give it a Try !!!

APPROACH #1 : DELETE JOIN

If @DeleteCount < 5% of @CompleteCount, use a list of IDs to DELETE from mytable:

CREATE TABLE mytable_delete_id
SELECT id FROM mytable WHERE LOCATE('.ac.',email) > 0;
ALTER TABLE mytable_delete_id ADD PRIMARY KEY (id);
DELETE B.* FROM mytable_delete_id A INNER JOIN mytable B USING (id);
DROP TABLE mytable_delete_id;

This helps take the guess work out of which rows to DELETE by using a SELECT to get the IDs. The JOIN takes care of the DELETE.

APPROACH #2 : CREATE A NEW TABLE

If you have the following conditions:

@DeleteCount > 30% of @CompleteCount
your table does not have foreign keys/constraints

you could just create a new table and filter out the rows with .ac. in the email column:

CREATE TABLE mytable_new LIKE mytable;
#insert rows which does not contain '.ac.' from mytable
INSERT INTO mytable_new SELECT * FROM mytable WHERE LOCATE('.ac.',email) = 0;
ALTER TABLE mytable RENAME mytable_old;
ALTER TABLE mytable_new RENAME mytable;
DROP TABLE mytable_old;

EPILOGUE

You can do either of these methods without checking table counta ahead of time. They will both work. If you are concerned that .ac. is in a huge number of rows, go with APPROACH #2. If there are not that many occurrences, go with APPROACH #1.

Best Answer

Related Solutions

Mysql – How do automaticly delete rows from other tables

Mysql – Fastest way to delete matching rows in MySQL

APPROACH #1 : DELETE JOIN

APPROACH #2 : CREATE A NEW TABLE

EPILOGUE

Related Question