PostgreSQL UPSERT vs MERGE – Key Differences Explained

mergepostgresqlterminologyupsert

From the PostgreSQL wiki,

MERGE is typically used to merge two tables, and was introduced in the 2003 SQL standard. The REPLACE statement (a MySQL extension) or UPSERT sequence attempts an UPDATE, or on failure, INSERT. This is similar to UPDATE, then for unmatched rows, INSERT. Whether concurrent access allows modifications which could cause row loss is implementation independent.

Further PostgreSQL's INSERT ... ON CONFLICT DO NOTHING/UPDATE is marketed as UPSERT and was added in 9.5

What then is MERGE? And how does it fit into the mix?

Best Answer

Generally,

UPSERT is built off of INSERT
MERGE focuses on merging/synchronizing tables and provides
- conditionality (where clauses)
- DELETE support.

UPSERT lacks conditionality, In PostgreSQL, you had some ability to specify conditions, by proxy of the index that was being violated, for instance

ON CONFLICT ON CONSTRAINT countries_pkey DO NOTHING;
ON CONFLICT (country) DO NOTHING;

But it didn't provide the ability to specify multiple conditions nor the ability to DELETE in any condition, both of which permit a more rich set of rules making it possible to "synchronize tables with minimal work" which seems to be the goal of MERGE.

As another matter, from Peter Geoghegan's post highlighting some of the differences "SQL MERGE is quite distinct from UPSERT"

UPSERT is atomic in its insert-or-update
SQL MERGE doesn't technically require a UNIQUE INDEX.

DECLARE @Source TABLE(
    sl_no       INT
    ,name       NVARCHAR(30)
    ,dob        DATETIME2(3)
    ,city       NVARCHAR(30)
    ,occupation NVARCHAR(30)
);

INSERT INTO @Source
VALUES
    (1, 'galileo-galilei', '1900-01-01 00:00:00.000', NULL, NULL),
    (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
    (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
    (4, 'issac-newton',    '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')

SELECT DISTINCT
    name
    ,MAX(dob)           OVER(PARTITION BY name) AS dob
    ,MAX(city)          OVER(PARTITION BY name) AS city
    ,MAX(occupation)    OVER(PARTITION BY name) AS occupation
FROM 
    @Source

However, I suspect the reality is you could have multiple records and you always want to return the value from the most recent record that has data in those columns. E.g. if your source was:

DECLARE @Source TABLE(
    sl_no       INT
    ,name       NVARCHAR(30)
    ,dob        DATETIME2(3)
    ,city       NVARCHAR(30)
    ,occupation NVARCHAR(30)
);

INSERT INTO @Source
VALUES
    (1, 'galileo-galilei', '1900-01-01 00:00:00.000', 'rome', NULL),
    (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
    (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
    (4, 'issac-newton',    '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')

You could achieve what you want with:

SELECT
    s.name
    ,s.dob
    ,sc.city
    ,so.occupation
FROM
    @Source AS s
    CROSS APPLY(
        SELECT TOP 1 city
        FROM @Source AS s2
        WHERE s2.name = s.name
        AND city IS NOT NULL
        ORDER BY sl_no DESC
        ) AS sc
    CROSS APPLY(
        SELECT TOP 1 occupation
        FROM @Source AS s3
        WHERE s3.name = s.name
        AND occupation IS NOT NULL
        ORDER BY sl_no DESC
    ) AS so
WHERE
    s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)

Wrap that up into a merge or update (I'll do a merge for you) and you will get:

WITH src AS (
    SELECT
        s.name
        ,s.dob
        ,sc.city
        ,so.occupation
    FROM
        @Source AS s
        CROSS APPLY(
            SELECT TOP 1 city
            FROM @Source AS s2
            WHERE s2.name = s.name
            AND city IS NOT NULL
            ORDER BY sl_no DESC
            ) AS sc
        CROSS APPLY(
            SELECT TOP 1 occupation
            FROM @Source AS s3
            WHERE s3.name = s.name
            AND occupation IS NOT NULL
            ORDER BY sl_no DESC
        ) AS so
    WHERE
        s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)
)
MERGE INTO Destination AS tgt
USING tgt.name = src.name

WHEN MATCHED THEN UPDATE
SET dob = src.dob
    ,city = src.city
    ,occupation = src.occupation

WHEN NOT MATCHED THEN INSERT(name, dob, city, occupation)
VALUES(src.name, src.dob, src.city, src.occuptaion);

You are going to want to index the column you join on all the time (name in the above examples) for performance. Otherwise you will get many scans.

PostgreSQL UPSERT vs MERGE – Key Differences Explained

Best Answer

See also

Related Question

Best Answer

See also

Related Solutions

Setting Values Conditionally in PostgreSQL UPSERT

SQL Server – Merge and Update Multiple Rows Across Columns

Related Question