Sql-server – merge like rows then update fk from another table to match the new values

duplicationjoin;sql servert-sql

I have 2 tables tb1 (id, name, record_no, location,…) tb2(id, test, date,…)

Tb1 is joined to tb2 on tb1.id = tb2.id

The issue is that TB1 has duplicated entry's that have new ids. I need to merge these ids into one 1 for each unique entry then update tb2.id to match the change.

Not sure of the most effective way to do this without having to manually update each row.

Before

SELECT * FROM TB1

ID, NAME,    DOB,        RECORD_NUM
1, John Doe, 01/01/1900, 123456789
2, John Doe, 01/01/1900, 123456789
3, Jane Doe, 11/03/2016, 294018400
4, Jane Doe, 11/03/2016, 294018400
...

SELECT * FROM TB2

ID, Test,    Result, Date
1,  English, Pass,   01/01/1900
1,  Grammer, Fail,   01/02/1900
2,  Gym,     Pass,   01/01/1900
3,  Art,     Pass,   11/02/2016
4,  Gym,     Pass,   11/03/2016
...

Basically I need to take row ID 2 and merge it will ID 1 from TB1 then where 2 appears in TB2 I need to update it to 1.

I know the entries are the same usually by the Record_num or if that is a null value I can use the name and dob (since together they should be unique in the set).

After

SELECT * FROM TB1

ID, NAME,    DOB,        RECORD_NUM
1, John Doe, 01/01/1900, 123456789
3, Jane Doe, 11/03/2016, 294018400
...

SELECT * FROM TB2

ID, Test,    Result, Date
1,  English, Pass,   01/01/1900
1,  Grammer, Fail,   01/02/1900
1,  Gym,     Pass,   01/01/1900
3,  Art,     Pass,   11/02/2016
3,  Gym,     Pass,   11/03/2016
...

I hope this helps explain a little more.

Best Answer

You could update the second table first, then delete the duplicate (and now unreferenced) rows from the first table.

The (PARTITION BY name, dob, record_num) is what identifies rows as duplicates. If more or less columns are needed to identify then, adjust accordingly.

It would be good to put the two statements in a transaction to avoid weird effects / errors if other sessions access the table (inserting new rows or deleting between the 2 statements may result in the 2nd one to fail or having unreferenced rows in the end:

WITH ids AS
( SELECT dup_id = id,
         good_id = MIN(id) OVER (PARTITION BY name, dob, record_num) 
  FROM tb1 
) 
UPDATE t2
SET t2.id = i.good_id
FROM tb2 AS t2 
  JOIN ids AS i
    ON i.dup_id = t2.id
WHERE i.dup_id <> i.good_id ;


WITH ids AS
( SELECT dup_id = id,
         good_id = MIN(id) OVER (PARTITION BY name, dob, record_num) 
  FROM tb1 
) 
DELETE d
FROM tb1 AS d
  JOIN ids AS i
    ON i.dup_id = d.id
WHERE i.dup_id <> i.good_id ; 

Tested a rextester.com


The 2nd statement could have been written more simply but I find the above 1st way slightly more readable as the 2 statements have almost identical FROM and WHERE clauses.

DELETE i
FROM 
      ( SELECT dup_id = id,
               good_id = MIN(id) OVER (PARTITION BY name, dob, record_num) 
        FROM tb1 
      )               -- the ids CTE rewritten as a derived table
      AS i
WHERE i.dup_id <> i.good_id ;