Find duplicate records across one-to-many relationship tables

duplicationjoin;

Let's say I have two tables that have a one-to-many relationship: People and Attributes. I'm trying to find duplicates based on the first_name, last_name, and all attributes must match exactly.

CREATE TABLE People (Id int, 
                 first_name varchar(100), 
                 last_name varchar(100));

CREATE TABLE Attributes (Id int, 
                     person_id int, 
                     field varchar(100),
                     field_value varchar(100));

INSERT INTO People VALUES (1, 'John', 'Smith');
INSERT INTO People VALUES (2, 'John', 'Smith');
INSERT INTO People VALUES (3, 'John', 'Smith');

INSERT INTO Attributes VALUES (1, 1, 'HairColor', 'Brown');
INSERT INTO Attributes VALUES (2, 1, 'EyeColor', 'Blue');
INSERT INTO Attributes VALUES (3, 2, 'HairColor', 'Brown');
INSERT INTO Attributes VALUES (4, 2, 'EyeColor', 'Blue');
INSERT INTO Attributes VALUES (5, 3, 'HairColor', 'Blonde');

Which gives us:

 id | first_name | last_name
----+------------+-----------
  1 | John       | Smith
  2 | John       | Smith
  3 | John       | Smith

 id | person_id |   field   | field_value
----+-----------+-----------+-------------
  1 |         1 | HairColor | Brown
  2 |         1 | EyeColor  | Blue
  3 |         2 | HairColor | Brown
  4 |         2 | EyeColor  | Blue
  5 |         3 | HairColor | Blonde

I'd like a query that will return IDs 1 and 2 from the People table. I can find duplicates within a single table:

select first_name,last_name,count(*) from People 
    group by first_name,last_name having ( count(*) > 1 );

But I'm having trouble joining the one-to-many table and detecting duplicates across both tables. How can I detect duplicates across tables that have a one-to-many relationship?

Best Answer

One way to do this (check the SQLfiddle):

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( select 1
      from 
      ( select * 
        from attributes a1
        where a1.person_id = p1.id
      union all
        select * 
        from attributes a2
        where a2.person_id = p2.id
      ) g
      group by field, field_value
      having count(*) <> 2
   ) ;

and another:

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      union 
        select field, field_value
        from attributes a2
       where a2.person_id = p2.id
      ) 
    except
      ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      intersect
        select field, field_value
        from attributes a2
        where a2.person_id = p2.id
      )
    ) ;

Intersect has higher precedence than except/minus, at least in Postgres and SQL Server. Just to be safe, you can use parentheses, to ensure the precedence.

Related Solutions

Mysql – get column from too many tables in thesql

If all the tables use the MyISAM Storage Engine and have the same table structure, I have some good news for you.

You can create a single table that consumes no additional space except a .frm file and some mapping info. The key is to take advantage of the MERGE (MRG_MyISAM) Storage Engine.

Here is how you can do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4) INSERT_METHOD=LAST;

Using this method, you can query the 4 tables at the same time like this:

SELECT ReqF FROM XMerge WHERE EmpName='John';

Was that simple, or what ???

In your case, you have 75 tables. You would do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

The beauty of this is that creating a MERGE table takes milliseconds. Just make sure every table has an index on EmpName. Better to do 75 indexed lookups that 75 full table scans. If there is no index on EmpName, you need to do this:

ALTER TABLE X1 ADD UNIQUE KEY (EmpName);
ALTER TABLE X2 ADD UNIQUE KEY (EmpName);
.
.
.
ALTER TABLE X75 ADD UNIQUE KEY (EmpName);
CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

Give it a Try !!!

Join the two tables which possess one to many relationship

The relationship appears to be between content_field_mem_id.field_mem_id_value and ex.mem_id However, since content_field_mem_id.field_mem_id_value is an int and ex.mem_id is a varchar, you will need to cast/convert one of the columns as part of your join.

It would be more reliable (although slower) to convert the int column to a varchar.

You didn't say which DB technology you are using, so I can't show the exact syntax for the convert (it is very different between MS SQL/Oracle/DB2/etc).

Best Answer

Related Solutions

Mysql – get column from too many tables in thesql

Join the two tables which possess one to many relationship

Related Question