SQL Server – Merge and Update Multiple Rows Across Columns

join;mergesql serverupdateupsert

I am trying to consolidate data in one table and merge it into another.

The data in the ~~source~~ destination table is as follows :

name                |dob                        |city   |occupation
-----------------------------------------------------------------------------------
galileo-galilei     |1900-01-01 00:00:00.000    |rome   |polymath
issac-newton        |1900-01-01 00:00:00.000    |london |mathematician-scientist
leonardo-da-cinci   |1900-01-03 00:00:00.000    |rome   |polymath

The data in the ~~destination~~ source table is :

sl_no   |name               |dob                        |city   |occupation
-----------------------------------------------------------------------------
1       |galileo-galilei    |1900-01-01 00:00:00.000    |       |
2       |galileo-galilei    |1900-01-02 00:00:00.000    |venice |
3       |galileo-galilei    |1900-01-05 00:00:00.000    |       |astronomer

The expected result in the destination table is :

name                |dob                        |city   |occupation
-----------------------------------------------------------------------------------
galileo-galilei     |1900-01-05 00:00:00.000    |venice |astronomer
issac-newton        |1900-01-01 00:00:00.000    |london |mathematician-scientist
leonardo-da-cinci   |1900-01-03 00:00:00.000    |rome   |polymath

My attempts using update-with-join and merge have not been successful.

Update-With-Join :

-- updates data from the first match only
update p
set p.city = s.city,
p.occupation = s.occupation
from person_update_with_join_test_primary p, person_update_with_join_test_secondary s
where p.name = s.name ;

Merge :

-- https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
/*
The MERGE statement attempted to UPDATE or DELETE the same row more than once. 
This happens when a target row matches more than one source row. 
A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. 
Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
*/
begin
merge person_update_with_join_test_primary as p
using person_update_with_join_test_secondary as s
on (p.name = s.name)
when not matched by target 
then insert (name, dob, city, occupation) 
values (s.name, s.dob, s.city, s.occupation)
when matched 
then update set p.dob = s.dob 
, p.city=(case when (len(s.city)>0) then  s.city else p.city end)
, p.occupation=(case when (len(s.occupation)>0) then  s.occupation else p.occupation end)
output $action, inserted.*, deleted.*;
end

I believe what I am looking for is similar to what has been posted here and here. However its not quite what I am looking for.

Are there any other ways to accomplish this other than using a cursor and an upsert (assuming that works)?

Update #1 :

Basically as long as the values in the source are not empty the most recent values (having highest id value) in the source are expected to be merged into the destination.

Eg : For #3 row in the source, the city column would not be considered to be merged into destination. Similarly for #2 , the occupation column would not be considered for being merged into destination.
The the column name is a primary-key in the destination table.

I am trying to achieve the same state in the destination table as would have been expected if I were to iterate over the source data and update only the non-empty values in the destination – by using a query instead of doing it through application.

Best Answer

If you only ever have one row were the city and occupation columns are populated you can achieve it with a windowing function:

E.g:

DECLARE @Source TABLE(
    sl_no       INT
    ,name       NVARCHAR(30)
    ,dob        DATETIME2(3)
    ,city       NVARCHAR(30)
    ,occupation NVARCHAR(30)
);

INSERT INTO @Source
VALUES
    (1, 'galileo-galilei', '1900-01-01 00:00:00.000', NULL, NULL),
    (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
    (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
    (4, 'issac-newton',    '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')

SELECT DISTINCT
    name
    ,MAX(dob)           OVER(PARTITION BY name) AS dob
    ,MAX(city)          OVER(PARTITION BY name) AS city
    ,MAX(occupation)    OVER(PARTITION BY name) AS occupation
FROM 
    @Source

However, I suspect the reality is you could have multiple records and you always want to return the value from the most recent record that has data in those columns. E.g. if your source was:

DECLARE @Source TABLE(
    sl_no       INT
    ,name       NVARCHAR(30)
    ,dob        DATETIME2(3)
    ,city       NVARCHAR(30)
    ,occupation NVARCHAR(30)
);

INSERT INTO @Source
VALUES
    (1, 'galileo-galilei', '1900-01-01 00:00:00.000', 'rome', NULL),
    (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
    (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
    (4, 'issac-newton',    '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')

You could achieve what you want with:

SELECT
    s.name
    ,s.dob
    ,sc.city
    ,so.occupation
FROM
    @Source AS s
    CROSS APPLY(
        SELECT TOP 1 city
        FROM @Source AS s2
        WHERE s2.name = s.name
        AND city IS NOT NULL
        ORDER BY sl_no DESC
        ) AS sc
    CROSS APPLY(
        SELECT TOP 1 occupation
        FROM @Source AS s3
        WHERE s3.name = s.name
        AND occupation IS NOT NULL
        ORDER BY sl_no DESC
    ) AS so
WHERE
    s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)

Wrap that up into a merge or update (I'll do a merge for you) and you will get:

WITH src AS (
    SELECT
        s.name
        ,s.dob
        ,sc.city
        ,so.occupation
    FROM
        @Source AS s
        CROSS APPLY(
            SELECT TOP 1 city
            FROM @Source AS s2
            WHERE s2.name = s.name
            AND city IS NOT NULL
            ORDER BY sl_no DESC
            ) AS sc
        CROSS APPLY(
            SELECT TOP 1 occupation
            FROM @Source AS s3
            WHERE s3.name = s.name
            AND occupation IS NOT NULL
            ORDER BY sl_no DESC
        ) AS so
    WHERE
        s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)
)
MERGE INTO Destination AS tgt
USING tgt.name = src.name

WHEN MATCHED THEN UPDATE
SET dob = src.dob
    ,city = src.city
    ,occupation = src.occupation

WHEN NOT MATCHED THEN INSERT(name, dob, city, occupation)
VALUES(src.name, src.dob, src.city, src.occuptaion);

You are going to want to index the column you join on all the time (name in the above examples) for performance. Otherwise you will get many scans.

Related Solutions

Sql-server – How to delete only related records in a multi-key MERGE in SQL Server

This is the separate DELETE operation I had in mind:

DELETE m
FROM dbo.Mapping AS m
WHERE EXISTS 
  (SELECT 1 FROM @Values WHERE LeftID = m.LeftID)
AND NOT EXISTS 
  (SELECT 1 FROM @Values WHERE LeftID = m.LeftID AND RightID = m.RightID);

As I outline here, for a left anti-semi join, the NOT EXISTS pattern will often outperform the LEFT JOIN / NULL pattern (but you should always test).

Not sure if your overall goal is clarity or performance, so only you can judge if this will work out better for your requirements than the NOT MATCHED BY source option. You'll have to look at the plans qualitatively, and the plans and/or runtime metrics quantitatively, to know for sure.

If you expect your MERGE command to protect you from race conditions that would happen with multiple independent statements, you better make sure that is true by changing it to:

MERGE dbo.Mapping WITH (HOLDLOCK) AS target

(From Dan Guzman's blog post.)

Personally, I would do all of this without MERGE, because there are unresolved bugs, among other reasons. And Paul White seems to recommend separate DML statements as well.

And here's why I added a schema prefix: you should always reference objects by schema, when creating, affecting, etc.

How to Avoid Using Merge Query When Upserting Multiple Data with XML Parameter

Whether the source is XML or a TVP does not make a huge difference. The overall operation is essentially:

UPDATE existing rows
INSERT missing rows

You do it in that order because if you INSERT first, then all rows exist to get the UPDATE and you will do repeated work for any rows that were just inserted.

Beyond that there are different ways to accomplish this and various ways to tweak some additional efficiency out of it.

Let's start with the bare minimum. Since extracting the XML is likely to be one of the more expensive parts of this operation (if not the most expensive), we don't want to have to do that twice (as we have two operations to perform). So, we create a temp table and extract the data out of the XML into it:

CREATE TABLE #TempImport
(
  Field1 DataType1,
  Field2 DataType2,
  ...
);

INSERT INTO #TempImport (Field1, Field2, ...)
  SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
         tab.col.value('XQueryForField2', 'DataType') AS [Field2],
         ...
  FROM   @XmlInputParam.nodes('XQuery') tab(col);

From there we do the UPDATE and then the INSERT:

UPDATE tab
SET    tab.Field1 = tmp.Field1,
       tab.Field2 = tmp.Field2,
       ...
FROM   [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
        ON tmp.IDField = tab.IDField
        ... -- more fields if PK or alternate key is composite

INSERT INTO [SchemaName].[TableName]
  (Field1, Field2, ...)
  SELECT tmp.Field1, tmp.Field2, ...
  FROM   #TempImport tmp
  WHERE  NOT EXISTS (
                       SELECT  *
                       FROM    [SchemaName].[TableName] tab
                       WHERE   tab.IDField = tmp.IDField
                       ... -- more fields if PK or alternate key is composite
                     );

Now that we have the basic operation down, we can do a few things to optimize:

capture @@ROWCOUNT of insert into temp table and compare to @@ROWCOUNT of the UPDATE. If they are the same then we can skip the INSERT
capture the ID values updated via the OUTPUT clause and DELETE those from the temp table. Then the INSERT doesn't need the WHERE NOT EXISTS(...)
IF there are any rows in the incoming data that should not be synced (i.e. neither inserted nor updated), then those records should be removed prior to doing the UPDATE

CREATE TABLE #TempImport
(
  Field1 DataType1,
  Field2 DataType2,
  ...
);

DECLARE @ImportRows INT;
DECLARE @UpdatedIDs TABLE ([IDField] INT NOT NULL);

BEGIN TRY

  INSERT INTO #TempImport (Field1, Field2, ...)
    SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
           tab.col.value('XQueryForField2', 'DataType') AS [Field2],
           ...
    FROM   @XmlInputParam.nodes('XQuery') tab(col);

  SET @ImportRows = @@ROWCOUNT;

  IF (@ImportRows = 0)
  BEGIN
    RAISERROR('Seriously?', 16, 1); -- no rows to import
  END;

  -- optional: test to see if it helps or hurts
  -- ALTER TABLE #TempImport
  --   ADD CONSTRAINT [PK_#TempImport]
  --   PRIMARY KEY CLUSTERED (PKField ASC)
  --   WITH FILLFACTOR = 100;


  -- optional: remove any records that should not be synced
  DELETE tmp
  FROM   #TempImport tmp
  INNER JOIN [SchemaName].[TableName] tab
          ON tab.IDField = tmp.IDField
          ... -- more fields if PK or alternate key is composite
  WHERE  tmp.ModifiedDate < tab.ModifiedDate;

  BEGIN TRAN;

  UPDATE tab
  SET    tab.Field1 = tmp.Field1,
         tab.Field2 = tmp.Field2,
         ...
  OUTPUT INSERTED.IDField
  INTO   @UpdatedIDs ([IDField]) -- capture IDs that are updated
  FROM   [SchemaName].[TableName] tab
  INNER JOIN #TempImport tmp
          ON tmp.IDField = tab.IDField
          ... -- more fields if PK or alternate key is composite

  IF (@@ROWCOUNT < @ImportRows) -- if all rows were updates then skip, else insert remaining
  BEGIN
    -- get rid of rows that were updates, leaving only the ones to insert
    DELETE tmp
    FROM   #TempImport tmp
    INNER JOIN @UpdatedIDs del
            ON del.[IDField] = tmp.[IDField];

    -- OR, rather than the DELETE, maybe add a column to #TempImport for:
    -- [IsUpdate] BIT NOT NULL DEFAULT (0)
    -- Then UPDATE #TempImport SET [IsUpdate] = 1 JOIN @UpdatedIDs ON [IDField]
    -- Then, in below INSERT, add:  WHERE [IsUpdate] = 0

    INSERT INTO [SchemaName].[TableName]
      (Field1, Field2, ...)
      SELECT tmp.Field1, tmp.Field2, ...
      FROM   #TempImport tmp
  END;

  COMMIT TRAN;

END TRY
BEGIN CATCH
  IF (@@TRANCOUNT > 0)
  BEGIN
    ROLLBACK;
  END;

  -- THROW; -- if using SQL 2012 or newer, use this and remove the following 3 lines
  DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
  RAISERROR(@ErrorMessage, 16, 1);
  RETURN;
END CATCH;

I have used this model several times on Imports / ETLs that either have well over 1000 rows or maybe 500 in a batch out of a total set of 20k - over a million rows. However, I have not tested the performance difference between the DELETE of the updated rows out of the temp table vs just updating the [IsUpdate] field.

Please note regarding the decision to use XML over TVP due to there being, at most, 1000 rows to import at a time (mentioned in the question):

If this is being called a few times here and there, then quite possibly the minor performance gain in TVP might not be worth the additional maintenance cost (needing to drop the proc before changing the User-Defined Table Type, app code changes, etc). But if you are importing 4 million rows, sending 1000 at a time, that is 4000 executions (and 4 million rows of XML to parse no matter how it is broken up), and even a minor performance difference when executed only a few times will add up to a noticeable difference.

That being said, the method as I have described does not change outside of replacing the SELECT FROM @XmlInputParam to be SELECT FROM @TVP. Since TVPs are read-only, you wouldn't be able to delete from them. I guess you could simply add a WHERE NOT EXISTS(SELECT * FROM @UpdateIDs ids WHERE ids.IDField = tmp.IDField) to that final SELECT (tied to the INSERT) instead of the simple WHERE IsUpdate = 0. If you were to use the @UpdateIDs table variable in this manner, then you could even get away with not dumping the incoming rows into the temp table.

Best Answer

Related Solutions

Sql-server – How to delete only related records in a multi-key MERGE in SQL Server

How to Avoid Using Merge Query When Upserting Multiple Data with XML Parameter

Related Question