SQL Server – How to Merge/Intersect Between Table and Dataset

insertmergesql server

Consider the following table:

Id          Hash
----------- ----------------------------------
1           0x31F777F0804D301936411E3ECD760859
2           0xD64A593F3E9ACC972158D522A4289EA0

(Id is an identity column)

In to that table I want to merge the following dataset:

Hash
----------------------------------
0x31F777F0804D301936411E3ECD760859
0x31F777F0804D301936411E3ECD760859
0x0C5A65264F92A543E7AAA06375349C06

(Id is NOT present in the dataset)

The rules for the merge are:

  • If the hash does not exist in the table, insert it to the table;
  • If the hash does not exist in the dataset, delete it from the table;
  • If the hash does exist on both sides, and there are X instances in the table and Y instances in the source, then (Y-X) instances should be inserted in to the table.

The result of the merge should leave the table looking like this:

Id          Hash
----------- ----------------------------------
1           0x31F777F0804D301936411E3ECD760859
3           0x31F777F0804D301936411E3ECD760859
4           0x0C5A65264F92A543E7AAA06375349C06

What is the most efficient way to write a query to achieve this operation? FYI, other columns are omitted for brevity.

Best Answer

Using the sample data:

DECLARE @T table
(
    Id integer IDENTITY NOT NULL PRIMARY KEY, 
    [Hash] binary(16) NOT NULL INDEX h
);

INSERT @T ([Hash]) VALUES (0x31F777F0804D301936411E3ECD760859);
INSERT @T ([Hash]) VALUES (0xD64A593F3E9ACC972158D522A4289EA0);

DECLARE @S table 
(
    [Hash] binary(16) NOT NULL
);

INSERT @S
    ([Hash])
VALUES
    (0x31F777F0804D301936411E3ECD760859),
    (0x31F777F0804D301936411E3ECD760859),
    (0x0C5A65264F92A543E7AAA06375349C06);

You could write it as a MERGE:

WITH
    T AS
    (
        SELECT
            T.[Hash], 
            rn = ROW_NUMBER() OVER (
                PARTITION BY T.[Hash] 
                ORDER BY T.[Hash], T.Id)
        FROM @T AS T
    ),
    S AS
    (
        SELECT DISTINCT
            S.[Hash],
            rn = ROW_NUMBER() OVER (
                PARTITION BY S.[Hash] 
                ORDER BY S.[Hash])
        FROM @S AS S
    )
MERGE T
USING S
    ON S.[Hash] = T.[Hash]
    AND S.rn = T.rn
WHEN NOT MATCHED BY TARGET THEN INSERT ([Hash]) VALUES (S.[Hash])
WHEN NOT MATCHED BY SOURCE THEN DELETE;

db<>fiddle

But for performance reasons (as well as some bugs), I would normally write it as two separate statements:

WITH ToDelete AS
(
    SELECT
        T.*
    FROM @T AS T
    WHERE 
        NOT EXISTS 
        (
            SELECT
                S.* 
            FROM @S AS S 
            WHERE 
                S.[Hash] = T.[Hash]
        )
)
DELETE ToDelete;
WITH ToInsert AS
(
    SELECT
        S.[Hash], 
        rn = ROW_NUMBER() OVER (
            PARTITION BY S.[Hash] 
            ORDER BY S.[Hash])
    FROM @S AS S
    EXCEPT
    SELECT
        T.[Hash], 
        rn = ROW_NUMBER() OVER (
            PARTITION BY T.[Hash] 
            ORDER BY T.[Hash], T.Id)
    FROM @T AS T
)
INSERT @T
    ([Hash])
SELECT
    ToInsert.[Hash]
FROM ToInsert;

db<>fiddle

You should have a unique index on the target on ([Hash], [Id]). Chances are you have this already, or equivalently an index on [Hash] and a unique (possibly clustered) index on [Id].

There may be other issues depending on what else was omitted from the question for brevity. In any case it should provide a couple of possible starting points for your own solution.