MySQL – How to Select All Attributes of Duplicate Entries

group byMySQL

I have a database that containes user data and need to dig up only the entries where two attributes are duplicate between users (surname and social security number). I need all of the attributes for these entries, such as email and firstname.

I first used group by to get every user where those two values are duplicates with this query:

SELECT * 
FROM jasenet 
GROUP BY hetu, sukunimi 
HAVING count(*) >= 2 
ORDER BY sukunimi;

This returns all duplicate entries. After some research I tried to use IN operator in order to find out each individual duplicate person with this query:

SELECT * 
FROM jasenet 
WHERE sukunimi 
IN (SELECT sukunimi 
    FROM jasenet 
    GROUP BY hetu, sukunimi 
    HAVING count(*) >= 2 
    ORDER BY sukunimi)
ORDER BY sukunimi;

This works somewhat because I get each individual with duplicate surname, but in addition to this I need to be able to have the query use the social security number as well (hetu).

I tried this query, but it didn't work:

SELECT * 
FROM jasenet 
WHERE sukunimi 
IN (SELECT sukunimi 
    FROM jasenet 
    GROUP BY hetu, sukunimi 
    HAVING count(*) >= 2 
    ORDER BY sukunimi)
AND hetu
IN (SELECT sukunimi 
    FROM jasenet 
    GROUP BY hetu, sukunimi 
    HAVING count(*) >= 2 
    ORDER BY sukunimi)
ORDER BY sukunimi;

How can I use two columns returned from a query in another query?

Best Answer

There are a few way of doing this and depends on the size of your initial table.

Option 1:

/** This gives you all the surnames that are duplicated **/
Create Temporary Table tmpData
(surname CHAR(100) NOT NULL,
counter INT NOT NULL DEFAULT 1,
Primary Key (surname));
Insert Into tmpData
Select surname, count(1) as recs
From (jasenet)
Group BY 1
Having recs > 1;

/** This gives you all the SINS that are duplicated **/
Create Temporary Table tmpData2
(sin CHAR(100) NOT NULL,
counter INT NOT NULL DEFAULT 1,
Primary Key (surname));
Insert Into tmpData
Select sin, count(1) as recs
From (jasenet)
Group BY 1
Having recs > 1;

Select DISTINCT A.* 
From jasenet A, tmpData B, tmpData2 C
Where A.surname = B.surname
Or A.sin = C.sin;

Option 2 (Horribly slow -- never do)

Select A.userid, B.userid, 
    IF(A.surname = B.surname, 1, 0) as sameSurname,
    IF(A.sin = B.sin, 1, 0) as sameSIN
From (jasenet A, jasenet B)
Where A.userid != B.userid;

Option 3 -- Probably what you're looking for

Select A.surname, group_concat(A.userid) as users
From (jasenet A)
Group By 1
Having COUNT(A.surname) > 1;

Select A.sin, group_concat(A.userid) as users
From (jasenet A)
Group By 1
Having COUNT(A.sin) > 1;

So I gave a few differences but there are a few things I should mention. You are using GROUP BY without an aggregate function which in general means you can have non-deterministic results. Also it's just not how you use group by. What I tell my developers is when using group by it's as if you're building a new table with X fields as the PRIMARY KEY of that table.

Do not use #2 as it has to basically do 1 full scan for every userid, I just wanted to give it to you so you can see that you can double link a table (as long as it's not a temporary table).

Option 1 Is similar to #3 but uses temp tables which are extremely fast and I figured you can then use the temp tables to do your linking on another table. Just remember, temporary tables are gone once you CLOSE the session.

Let me know if this is not what you were looking for