Find duplicate records across one-to-many relationship tables

duplicationjoin;

Let's say I have two tables that have a one-to-many relationship: People and Attributes. I'm trying to find duplicates based on the first_name, last_name, and all attributes must match exactly.

CREATE TABLE People (Id int, 
                 first_name varchar(100), 
                 last_name varchar(100));

CREATE TABLE Attributes (Id int, 
                     person_id int, 
                     field varchar(100),
                     field_value varchar(100));

INSERT INTO People VALUES (1, 'John', 'Smith');
INSERT INTO People VALUES (2, 'John', 'Smith');
INSERT INTO People VALUES (3, 'John', 'Smith');

INSERT INTO Attributes VALUES (1, 1, 'HairColor', 'Brown');
INSERT INTO Attributes VALUES (2, 1, 'EyeColor', 'Blue');
INSERT INTO Attributes VALUES (3, 2, 'HairColor', 'Brown');
INSERT INTO Attributes VALUES (4, 2, 'EyeColor', 'Blue');
INSERT INTO Attributes VALUES (5, 3, 'HairColor', 'Blonde');

Which gives us:

 id | first_name | last_name
----+------------+-----------
  1 | John       | Smith
  2 | John       | Smith
  3 | John       | Smith

 id | person_id |   field   | field_value
----+-----------+-----------+-------------
  1 |         1 | HairColor | Brown
  2 |         1 | EyeColor  | Blue
  3 |         2 | HairColor | Brown
  4 |         2 | EyeColor  | Blue
  5 |         3 | HairColor | Blonde

I'd like a query that will return IDs 1 and 2 from the People table. I can find duplicates within a single table:

select first_name,last_name,count(*) from People 
    group by first_name,last_name having ( count(*) > 1 );

But I'm having trouble joining the one-to-many table and detecting duplicates across both tables. How can I detect duplicates across tables that have a one-to-many relationship?

Best Answer

One way to do this (check the SQLfiddle):

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( select 1
      from 
      ( select * 
        from attributes a1
        where a1.person_id = p1.id
      union all
        select * 
        from attributes a2
        where a2.person_id = p2.id
      ) g
      group by field, field_value
      having count(*) <> 2
   ) ;

and another:

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      union 
        select field, field_value
        from attributes a2
       where a2.person_id = p2.id
      ) 
    except
      ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      intersect
        select field, field_value
        from attributes a2
        where a2.person_id = p2.id
      )
    ) ;

Intersect has higher precedence than except/minus, at least in Postgres and SQL Server. Just to be safe, you can use parentheses, to ensure the precedence.