Whether the source is XML or a TVP does not make a huge difference. The overall operation is essentially:
- UPDATE existing rows
- INSERT missing rows
You do it in that order because if you INSERT first, then all rows exist to get the UPDATE and you will do repeated work for any rows that were just inserted.
Beyond that there are different ways to accomplish this and various ways to tweak some additional efficiency out of it.
Let's start with the bare minimum. Since extracting the XML is likely to be one of the more expensive parts of this operation (if not the most expensive), we don't want to have to do that twice (as we have two operations to perform). So, we create a temp table and extract the data out of the XML into it:
CREATE TABLE #TempImport
(
Field1 DataType1,
Field2 DataType2,
...
);
INSERT INTO #TempImport (Field1, Field2, ...)
SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
tab.col.value('XQueryForField2', 'DataType') AS [Field2],
...
FROM @XmlInputParam.nodes('XQuery') tab(col);
From there we do the UPDATE and then the INSERT:
UPDATE tab
SET tab.Field1 = tmp.Field1,
tab.Field2 = tmp.Field2,
...
FROM [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
ON tmp.IDField = tab.IDField
... -- more fields if PK or alternate key is composite
INSERT INTO [SchemaName].[TableName]
(Field1, Field2, ...)
SELECT tmp.Field1, tmp.Field2, ...
FROM #TempImport tmp
WHERE NOT EXISTS (
SELECT *
FROM [SchemaName].[TableName] tab
WHERE tab.IDField = tmp.IDField
... -- more fields if PK or alternate key is composite
);
Now that we have the basic operation down, we can do a few things to optimize:
capture @@ROWCOUNT of insert into temp table and compare to @@ROWCOUNT of the UPDATE. If they are the same then we can skip the INSERT
capture the ID values updated via the OUTPUT clause and DELETE those from the temp table. Then the INSERT doesn't need the WHERE NOT EXISTS(...)
IF there are any rows in the incoming data that should not be synced (i.e. neither inserted nor updated), then those records should be removed prior to doing the UPDATE
CREATE TABLE #TempImport
(
Field1 DataType1,
Field2 DataType2,
...
);
DECLARE @ImportRows INT;
DECLARE @UpdatedIDs TABLE ([IDField] INT NOT NULL);
BEGIN TRY
INSERT INTO #TempImport (Field1, Field2, ...)
SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
tab.col.value('XQueryForField2', 'DataType') AS [Field2],
...
FROM @XmlInputParam.nodes('XQuery') tab(col);
SET @ImportRows = @@ROWCOUNT;
IF (@ImportRows = 0)
BEGIN
RAISERROR('Seriously?', 16, 1); -- no rows to import
END;
-- optional: test to see if it helps or hurts
-- ALTER TABLE #TempImport
-- ADD CONSTRAINT [PK_#TempImport]
-- PRIMARY KEY CLUSTERED (PKField ASC)
-- WITH FILLFACTOR = 100;
-- optional: remove any records that should not be synced
DELETE tmp
FROM #TempImport tmp
INNER JOIN [SchemaName].[TableName] tab
ON tab.IDField = tmp.IDField
... -- more fields if PK or alternate key is composite
WHERE tmp.ModifiedDate < tab.ModifiedDate;
BEGIN TRAN;
UPDATE tab
SET tab.Field1 = tmp.Field1,
tab.Field2 = tmp.Field2,
...
OUTPUT INSERTED.IDField
INTO @UpdatedIDs ([IDField]) -- capture IDs that are updated
FROM [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
ON tmp.IDField = tab.IDField
... -- more fields if PK or alternate key is composite
IF (@@ROWCOUNT < @ImportRows) -- if all rows were updates then skip, else insert remaining
BEGIN
-- get rid of rows that were updates, leaving only the ones to insert
DELETE tmp
FROM #TempImport tmp
INNER JOIN @UpdatedIDs del
ON del.[IDField] = tmp.[IDField];
-- OR, rather than the DELETE, maybe add a column to #TempImport for:
-- [IsUpdate] BIT NOT NULL DEFAULT (0)
-- Then UPDATE #TempImport SET [IsUpdate] = 1 JOIN @UpdatedIDs ON [IDField]
-- Then, in below INSERT, add: WHERE [IsUpdate] = 0
INSERT INTO [SchemaName].[TableName]
(Field1, Field2, ...)
SELECT tmp.Field1, tmp.Field2, ...
FROM #TempImport tmp
END;
COMMIT TRAN;
END TRY
BEGIN CATCH
IF (@@TRANCOUNT > 0)
BEGIN
ROLLBACK;
END;
-- THROW; -- if using SQL 2012 or newer, use this and remove the following 3 lines
DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
RAISERROR(@ErrorMessage, 16, 1);
RETURN;
END CATCH;
I have used this model several times on Imports / ETLs that either have well over 1000 rows or maybe 500 in a batch out of a total set of 20k - over a million rows. However, I have not tested the performance difference between the DELETE of the updated rows out of the temp table vs just updating the [IsUpdate] field.
Please note regarding the decision to use XML over TVP due to there being, at most, 1000 rows to import at a time (mentioned in the question):
If this is being called a few times here and there, then quite possibly the minor performance gain in TVP might not be worth the additional maintenance cost (needing to drop the proc before changing the User-Defined Table Type, app code changes, etc). But if you are importing 4 million rows, sending 1000 at a time, that is 4000 executions (and 4 million rows of XML to parse no matter how it is broken up), and even a minor performance difference when executed only a few times will add up to a noticeable difference.
That being said, the method as I have described does not change outside of replacing the SELECT FROM @XmlInputParam to be SELECT FROM @TVP. Since TVPs are read-only, you wouldn't be able to delete from them. I guess you could simply add a WHERE NOT EXISTS(SELECT * FROM @UpdateIDs ids WHERE ids.IDField = tmp.IDField)
to that final SELECT (tied to the INSERT) instead of the simple WHERE IsUpdate = 0
. If you were to use the @UpdateIDs
table variable in this manner, then you could even get away with not dumping the incoming rows into the temp table.
Clarify ON CONFLICT DO UPDATE
behavior
Consider the manual here:
For each individual row proposed for insertion, either the insertion
proceeds, or, if an arbiter constraint or index specified by
conflict_target
is violated, the alternative conflict_action
is taken.
Bold emphasis mine. So you do not have to repeat predicates for columns included in the unique index in the WHERE
clause to the UPDATE
(the conflict_action
):
INSERT INTO test_upsert AS tu
(name , status, test_field , identifier, count)
VALUES ('shaun', 1 , 'test value', 'ident' , 1)
ON CONFLICT (name, status, test_field) DO UPDATE
SET count = tu.count + 1;
WHERE tu.name = 'shaun' AND tu.status = 1 AND tu.test_field = 'test value'
The unique violation already establishes what your added WHERE
clause would enforce redundantly.
Clarify partial index
Add a WHERE
clause to make it an actual partial index like you mentioned yourself (but with inverted logic):
CREATE UNIQUE INDEX test_upsert_partial_idx
ON public.test_upsert (name, status)
WHERE test_field IS NULL; -- not: "is not null"
To use this partial index in your UPSERT you need a matching conflict_target
like @ypercube demonstrates:
ON CONFLICT (name, status) WHERE test_field IS NULL
Now the above partial index is inferred. However, as the manual also notes:
[...] a non-partial unique index (a unique index without a predicate) will
be inferred (and thus used by ON CONFLICT
) if such an index satisfying
every other criteria is available.
If you have an additional (or only) index on just (name, status)
it will (also) be used. An index on (name, status, test_field)
would explicitly not be inferred. This doesn't explain your problem, but may have added to the confusion while testing.
Solution
AIUI, none of the above solves your problem, yet. With the partial index, only special cases with matching NULL values would be caught. And other duplicate rows would either be inserted if you have no other matching unique indexes / constraints, or raise an exception if you do. I suppose that's not what you want. You write:
The composite key is made up of 20 columns, 10 of which can be nullable.
What exactly do you consider a duplicate? Postgres (according to the SQL standard) does not consider two NULL values to be equal. The manual:
In general, a unique constraint is violated if there is more than one
row in the table where the values of all of the columns included in
the constraint are equal. However, two null values are never
considered equal in this comparison. That means even in the presence
of a unique constraint it is possible to store duplicate rows that
contain a null value in at least one of the constrained columns. This
behavior conforms to the SQL standard, but we have heard that other
SQL databases might not follow this rule. So be careful when
developing applications that are intended to be portable.
Related:
I assume you want NULL
values in all 10 nullable columns to be considered equal. It is elegant & practical to cover a single nullable column with an additional partial index like demonstrated here:
But this gets out of hand quickly for more nullable columns. You'd need a partial index for every distinct combination of nullable columns. For just 2 of those that's 3 partial indexes for (a)
, (b)
and (a,b)
. The number is growing exponentially with 2^n - 1
. For your 10 nullable columns, to cover all possible combinations of NULL values, you'd already need 1023 partial indexes. No go.
The simple solution: replace NULL values and define involved columns NOT NULL
, and everything would work just fine with a simple UNIQUE
constraint.
If that's not an option I suggest an expression index with COALESCE
to replace NULL in the index:
CREATE UNIQUE INDEX test_upsert_solution_idx
ON test_upsert (name, status, COALESCE(test_field, ''));
The empty string (''
) is an obvious candidate for character types, but you can use any legal value that either never appears or can be folded with NULL according to your definition of "unique".
Then use this statement:
INSERT INTO test_upsert as tu(name,status,test_field,identifier, count)
VALUES ('shaun', 1, null , 'ident', 11) -- works with
, ('bob' , 2, 'test value', 'ident', 22) -- and without NULL
ON CONFLICT (name, status, COALESCE(test_field, '')) DO UPDATE -- match expr. index
SET count = COALESCE(tu.count + EXCLUDED.count, EXCLUDED.count, tu.count);
Like @ypercube I assume you actually want to add count
to the existing count. Since the column can be NULL, adding NULL would set the column NULL. If you define count NOT NULL
, you can simplify.
Another idea would be to just drop the conflict_target from the statement to cover all unique violations. Then you could define various unique indexes for a more sophisticated definition of what's supposed to be "unique". But that won't fly with ON CONFLICT DO UPDATE
. The manual once more:
For ON CONFLICT DO NOTHING
, it is optional to specify a
conflict_target; when omitted, conflicts with all usable constraints
(and unique indexes) are handled. For ON CONFLICT DO UPDATE
, a
conflict_target must be provided.
Best Answer
In PostgreSQL we have a name for it
INSERT ... RETURNING
.You can use this form in a
CTE
to obtain the effect you want..Here we
t1
which we plan toINSERT
orSELECT
on.t2
weINSERT
the data we don't already have into our desired table...foo
, and return the results (that which was inserted) ast2
SELECT
to get all of the elements in our test data (t1
) that already exist.INSERT
all of the others.