MySQL – How to Select All Attributes of Duplicate Entries

group byMySQL

I have a database that containes user data and need to dig up only the entries where two attributes are duplicate between users (surname and social security number). I need all of the attributes for these entries, such as email and firstname.

I first used group by to get every user where those two values are duplicates with this query:

SELECT * 
FROM jasenet 
GROUP BY hetu, sukunimi 
HAVING count(*) >= 2 
ORDER BY sukunimi;

This returns all duplicate entries. After some research I tried to use IN operator in order to find out each individual duplicate person with this query:

SELECT * 
FROM jasenet 
WHERE sukunimi 
IN (SELECT sukunimi 
    FROM jasenet 
    GROUP BY hetu, sukunimi 
    HAVING count(*) >= 2 
    ORDER BY sukunimi)
ORDER BY sukunimi;

This works somewhat because I get each individual with duplicate surname, but in addition to this I need to be able to have the query use the social security number as well (hetu).

I tried this query, but it didn't work:

SELECT * 
FROM jasenet 
WHERE sukunimi 
IN (SELECT sukunimi 
    FROM jasenet 
    GROUP BY hetu, sukunimi 
    HAVING count(*) >= 2 
    ORDER BY sukunimi)
AND hetu
IN (SELECT sukunimi 
    FROM jasenet 
    GROUP BY hetu, sukunimi 
    HAVING count(*) >= 2 
    ORDER BY sukunimi)
ORDER BY sukunimi;

How can I use two columns returned from a query in another query?

Best Answer

There are a few way of doing this and depends on the size of your initial table.

Option 1:

/** This gives you all the surnames that are duplicated **/
Create Temporary Table tmpData
(surname CHAR(100) NOT NULL,
counter INT NOT NULL DEFAULT 1,
Primary Key (surname));
Insert Into tmpData
Select surname, count(1) as recs
From (jasenet)
Group BY 1
Having recs > 1;

/** This gives you all the SINS that are duplicated **/
Create Temporary Table tmpData2
(sin CHAR(100) NOT NULL,
counter INT NOT NULL DEFAULT 1,
Primary Key (surname));
Insert Into tmpData
Select sin, count(1) as recs
From (jasenet)
Group BY 1
Having recs > 1;

Select DISTINCT A.* 
From jasenet A, tmpData B, tmpData2 C
Where A.surname = B.surname
Or A.sin = C.sin;

Option 2 (Horribly slow -- never do)

Select A.userid, B.userid, 
    IF(A.surname = B.surname, 1, 0) as sameSurname,
    IF(A.sin = B.sin, 1, 0) as sameSIN
From (jasenet A, jasenet B)
Where A.userid != B.userid;

Option 3 -- Probably what you're looking for

Select A.surname, group_concat(A.userid) as users
From (jasenet A)
Group By 1
Having COUNT(A.surname) > 1;

Select A.sin, group_concat(A.userid) as users
From (jasenet A)
Group By 1
Having COUNT(A.sin) > 1;

So I gave a few differences but there are a few things I should mention. You are using GROUP BY without an aggregate function which in general means you can have non-deterministic results. Also it's just not how you use group by. What I tell my developers is when using group by it's as if you're building a new table with X fields as the PRIMARY KEY of that table.

Do not use #2 as it has to basically do 1 full scan for every userid, I just wanted to give it to you so you can see that you can double link a table (as long as it's not a temporary table).

Option 1 Is similar to #3 but uses temp tables which are extremely fast and I figured you can then use the temp tables to do your linking on another table. Just remember, temporary tables are gone once you CLOSE the session.

Let me know if this is not what you were looking for

Related Solutions

Mysql – Need to find duplicate entries

Suppose your table is called ingredients. Try the following:

Step 01) Create an empty delete keys table called ingredients_delete_keys

CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;

Step 02) Create PRIMARY KEY on ingredients_delete_keys

ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);

Step 03) Index the ingredients table with fk,recipe,pkey

ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);

Step 04) Populate the ingredients_delete_keys table

INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;

Step 05) Perform a DELETE JOIN on ingredients table using keys that don't match

DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;

Step 06) Drop the delete keys

DROP TABLE ingredients_delete_keys;

Step 07) Get rid of the fk_recipe_pkey_ndx index

ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;

OK Here are all the lines in one block...

CREATE TABLE ingredients_delete_keys
SELECT fk,recipe,pkey FROM ingredients WHERE 1=2;
ALTER TABLE ingredients_delete_keys ADD PRIMARY KEY (fk,recipe,pkey);
ALTER TABLE ingredients ADD INDEX fk_recipe_pkey_ndx (fk,recipe,pkey);
INSERT INTO ingredients_delete_keys
SELECT fk,recipe,MIN(pkey)
FROM ingredients GROUP BY fk,recipe;
DELETE B.*
FROM ingredients_delete_keys A
LEFT JOIN ingredients B
USING (fk,recipe,pkey)
WHERE B.pkey IS NULL;
DROP TABLE ingredients_delete_keys;
ALTER TABLE ingredients DROP INDEX fk_recipe_pkey_ndx;

Give it a Try !!!

CAVEAT

Notice that using MIN function helps keep the first pkey entered for fk. If you switch it to MAX function instead, the last pkey entered for fk is kept.

Mysql – remove duplicate rows in thesql table that does not contain primary key

In the spirit of @yercube's answer, I have an answer that has an added twist.

CREATE TABLE stage
(
    id int not null auto_increment,
    name varchar(20),
    primary key (id)
);
CREATE TABLE stage2 LIKE stage;
INSERT INTO stage (name) SELECT name FROM item;
INSERT INTO stage2 (id) SELECT min_id FROM
(SELECT MIN(id) min_id,name FROM stage GROUP BY name) A;
UPDATE stage2 A INNER JOIN stage B USING (id) SET A.name=B.name;
TRUNCATE TABLE item;
INSERT INTO item (name) SELECT name FROM stage2;
DROP TABLE stage;
DROP TABLE stage2;

This will load stage2 with the first occurrence of each name from item, zap the item table, and load the unique occurrences back.

If you look back in @yercube's answer and compare it to my answer, his is much more simplistic because

@yercube uses one temp table, while I use two
I had to create a column for iteration control, @yercube did not need to
@yercube has fewer steps
both answers achieve the same thing

I do not expect my answer to be accepted. The sole purpose of my answer was demonstrate that other answers lose the concise clarity needed to solve your problem. Again, hats off to @yercube.

Best Answer

Related Solutions

Mysql – Need to find duplicate entries

Mysql – remove duplicate rows in thesql table that does not contain primary key

Related Question