Sql-server – merge like rows then update fk from another table to match the new values

duplicationjoin;sql servert-sql

I have 2 tables tb1 (id, name, record_no, location,…) tb2(id, test, date,…)

Tb1 is joined to tb2 on tb1.id = tb2.id

The issue is that TB1 has duplicated entry's that have new ids. I need to merge these ids into one 1 for each unique entry then update tb2.id to match the change.

Not sure of the most effective way to do this without having to manually update each row.

Before

SELECT * FROM TB1

ID, NAME,    DOB,        RECORD_NUM
1, John Doe, 01/01/1900, 123456789
2, John Doe, 01/01/1900, 123456789
3, Jane Doe, 11/03/2016, 294018400
4, Jane Doe, 11/03/2016, 294018400
...

SELECT * FROM TB2

ID, Test,    Result, Date
1,  English, Pass,   01/01/1900
1,  Grammer, Fail,   01/02/1900
2,  Gym,     Pass,   01/01/1900
3,  Art,     Pass,   11/02/2016
4,  Gym,     Pass,   11/03/2016
...

Basically I need to take row ID 2 and merge it will ID 1 from TB1 then where 2 appears in TB2 I need to update it to 1.

I know the entries are the same usually by the Record_num or if that is a null value I can use the name and dob (since together they should be unique in the set).

After

SELECT * FROM TB1

ID, NAME,    DOB,        RECORD_NUM
1, John Doe, 01/01/1900, 123456789
3, Jane Doe, 11/03/2016, 294018400
...

SELECT * FROM TB2

ID, Test,    Result, Date
1,  English, Pass,   01/01/1900
1,  Grammer, Fail,   01/02/1900
1,  Gym,     Pass,   01/01/1900
3,  Art,     Pass,   11/02/2016
3,  Gym,     Pass,   11/03/2016
...

I hope this helps explain a little more.

Best Answer

You could update the second table first, then delete the duplicate (and now unreferenced) rows from the first table.

The (PARTITION BY name, dob, record_num) is what identifies rows as duplicates. If more or less columns are needed to identify then, adjust accordingly.

It would be good to put the two statements in a transaction to avoid weird effects / errors if other sessions access the table (inserting new rows or deleting between the 2 statements may result in the 2nd one to fail or having unreferenced rows in the end:

WITH ids AS
( SELECT dup_id = id,
         good_id = MIN(id) OVER (PARTITION BY name, dob, record_num) 
  FROM tb1 
) 
UPDATE t2
SET t2.id = i.good_id
FROM tb2 AS t2 
  JOIN ids AS i
    ON i.dup_id = t2.id
WHERE i.dup_id <> i.good_id ;


WITH ids AS
( SELECT dup_id = id,
         good_id = MIN(id) OVER (PARTITION BY name, dob, record_num) 
  FROM tb1 
) 
DELETE d
FROM tb1 AS d
  JOIN ids AS i
    ON i.dup_id = d.id
WHERE i.dup_id <> i.good_id ;

Tested a rextester.com

The 2nd statement could have been written more simply but I find the above 1st way slightly more readable as the 2 statements have almost identical FROM and WHERE clauses.

DELETE i
FROM 
      ( SELECT dup_id = id,
               good_id = MIN(id) OVER (PARTITION BY name, dob, record_num) 
        FROM tb1 
      )               -- the ids CTE rewritten as a derived table
      AS i
WHERE i.dup_id <> i.good_id ;

Related Solutions

Sql-server – T-SQL Check another table rows if match delete them

I can't say anything about SSIS but in SQL you can check whether a whole row is identical to another (including checking for NULL values which can get rather complicated otherwise), using this technique, explained by @PaulWhite in his blog post: Undocumented Query Plans: Equality Comparisons .

For example in you case. "For any identical rows of table B and table A, delete those B rows":

DELETE b                        -- from table B: Customer_information
FROM Customer_archive AS a
  JOIN Customer_information AS b
  ON a.pk = b.pk
WHERE EXISTS (SELECT a.* INTERSECT SELECT b.*) ;

I have serious concerns though about efficiency when the tables are big - and an archive table by definition is going to be quite big. The a.pk = b.pk is not needed really, as the pk columns are obviously included in the row check of the EXISTS but I kept them for efficiency. Assuming that the two tables have the same primary key and that after a period, most of the archived rows have been already deleted from table B, the PK indexes will have very few matching values so the join will be relatively fast - and using the row checks only for matching pk values.

Sql-server – Update table using values from another table in SQL Server

There are quite a few ways to achieve your desired results.

Undeterministic methods

(in the event that many rows in table 2 match one in table 1)

UPDATE T1
SET    address = T2.address,
       phone2 = T2.phone
FROM   #Table1 T1
       JOIN #Table2 T2
         ON T1.gender = T2.gender
            AND T1.birthdate = T2.birthdate

Or a slightly more concise form

UPDATE #Table1
SET    address = #Table2.address,
       phone2 = #Table2.phone
FROM   #Table2
WHERE  #Table2.gender = #Table1.gender
       AND #Table2.birthdate = #Table1.birthdate

Or with a CTE

WITH CTE
     AS (SELECT T1.address AS tgt_address,
                T1.phone2  AS tgt_phone,
                T2.address AS source_address,
                T2.phone   AS source_phone
         FROM   #Table1 T1
                INNER JOIN #Table2 T2
                  ON T1.gender = T2.gender
                     AND T1.birthdate = T2.birthdate)
UPDATE CTE
SET    tgt_address = source_address,
       tgt_phone = source_phone

Deterministic methods

MERGE would throw an error rather than accept non deterministic results

MERGE #Table1 T1
USING #Table2 T2
ON T1.gender = T2.gender
   AND T1.birthdate = T2.birthdate
WHEN MATCHED THEN
  UPDATE SET address = T2.address,
             phone2 = T2.phone;

Or you could pick a specific record if there is more than one match

With APPLY

UPDATE T1
SET    address = T2.address,
       phone2 = T2.phone
FROM   #Table1 T1
       CROSS APPLY (SELECT TOP 1 *
                    FROM   #Table2 T2
                    WHERE  T1.gender = T2.gender
                           AND T1.birthdate = T2.birthdate
                    ORDER  BY T2.PrimaryKey) T2

.. Or a CTE

WITH T2
     AS (SELECT *,
                ROW_NUMBER() OVER (PARTITION BY gender, birthdate ORDER BY primarykey) AS RN
         FROM   #Table2)
UPDATE T1
SET    address = T2.address,
       phone2 = T2.phone
FROM   #Table1 T1
       JOIN T2
         ON T1.gender = T2.gender
            AND T1.birthdate = T2.birthdate
            AND T2.RN = 1;

Best Answer

Related Solutions

Sql-server – T-SQL Check another table rows if match delete them

Sql-server – Update table using values from another table in SQL Server

Related Question