GUIDs only really cause performance problems for a couple of reasons.
First their size. You are using a 16 byte identifier where an int(4 bytes) or even bigint(8 bytes) is much smaller. This doesn't really matter so much on a wide table or short table. If your table is fairly narrow then the percent increase is of course greater. A 2000 byte width vs a 2012 byte width is really not that big a deal where a 20 vs 32 is obviously a greater difference. And if your table is going to be large, millions or even 100's of millions of rows, then that 8-12 bytes you are adding is going to make a more obvious difference than say a table with only 1000 rows. On the other hand if you need that portability then you may be willing to accept the space issues.
Next are the issues that only occur if the GUID is part of the clustered index. If the GUID is the clustered index then the GUID will be included the leaf level pages of all non-clustered indexes. So to compare, a table with an integer clustered index will only add 4 bytes to the leaf page while the GUID will add 16. This can add up over time particularly with multiple indexes. It also will of course increase the number of reads to use that index since you can fit less entries per page. The other problem GUIDs have with a clustered index is that they are not typically in order. This means potential page splits during inserts. The link you provided mentioned that this isn't a huge increase over the write times in Azure however. Typically you could use use NEWSEQUENTIALID() to generate the values as it returns a value greater than any previous value generated by NEWSEQUENTIALID() on that computer since windows was started however it is not supported by Azure.
In my opinion if you are already going to have a CreatedAtUTC column then go ahead and make it the clustered index and use the GUID as your primary key. Also here is a link I found to a blog by the Azure team on the subject. Of course he self processes to love GUIDs so he is somewhat biased. Also another benefit of using your CreatedAtUTC as the clustered index is that it doesn't change. This means minimal row moves.
Whether the source is XML or a TVP does not make a huge difference. The overall operation is essentially:
- UPDATE existing rows
- INSERT missing rows
You do it in that order because if you INSERT first, then all rows exist to get the UPDATE and you will do repeated work for any rows that were just inserted.
Beyond that there are different ways to accomplish this and various ways to tweak some additional efficiency out of it.
Let's start with the bare minimum. Since extracting the XML is likely to be one of the more expensive parts of this operation (if not the most expensive), we don't want to have to do that twice (as we have two operations to perform). So, we create a temp table and extract the data out of the XML into it:
CREATE TABLE #TempImport
(
Field1 DataType1,
Field2 DataType2,
...
);
INSERT INTO #TempImport (Field1, Field2, ...)
SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
tab.col.value('XQueryForField2', 'DataType') AS [Field2],
...
FROM @XmlInputParam.nodes('XQuery') tab(col);
From there we do the UPDATE and then the INSERT:
UPDATE tab
SET tab.Field1 = tmp.Field1,
tab.Field2 = tmp.Field2,
...
FROM [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
ON tmp.IDField = tab.IDField
... -- more fields if PK or alternate key is composite
INSERT INTO [SchemaName].[TableName]
(Field1, Field2, ...)
SELECT tmp.Field1, tmp.Field2, ...
FROM #TempImport tmp
WHERE NOT EXISTS (
SELECT *
FROM [SchemaName].[TableName] tab
WHERE tab.IDField = tmp.IDField
... -- more fields if PK or alternate key is composite
);
Now that we have the basic operation down, we can do a few things to optimize:
capture @@ROWCOUNT of insert into temp table and compare to @@ROWCOUNT of the UPDATE. If they are the same then we can skip the INSERT
capture the ID values updated via the OUTPUT clause and DELETE those from the temp table. Then the INSERT doesn't need the WHERE NOT EXISTS(...)
IF there are any rows in the incoming data that should not be synced (i.e. neither inserted nor updated), then those records should be removed prior to doing the UPDATE
CREATE TABLE #TempImport
(
Field1 DataType1,
Field2 DataType2,
...
);
DECLARE @ImportRows INT;
DECLARE @UpdatedIDs TABLE ([IDField] INT NOT NULL);
BEGIN TRY
INSERT INTO #TempImport (Field1, Field2, ...)
SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
tab.col.value('XQueryForField2', 'DataType') AS [Field2],
...
FROM @XmlInputParam.nodes('XQuery') tab(col);
SET @ImportRows = @@ROWCOUNT;
IF (@ImportRows = 0)
BEGIN
RAISERROR('Seriously?', 16, 1); -- no rows to import
END;
-- optional: test to see if it helps or hurts
-- ALTER TABLE #TempImport
-- ADD CONSTRAINT [PK_#TempImport]
-- PRIMARY KEY CLUSTERED (PKField ASC)
-- WITH FILLFACTOR = 100;
-- optional: remove any records that should not be synced
DELETE tmp
FROM #TempImport tmp
INNER JOIN [SchemaName].[TableName] tab
ON tab.IDField = tmp.IDField
... -- more fields if PK or alternate key is composite
WHERE tmp.ModifiedDate < tab.ModifiedDate;
BEGIN TRAN;
UPDATE tab
SET tab.Field1 = tmp.Field1,
tab.Field2 = tmp.Field2,
...
OUTPUT INSERTED.IDField
INTO @UpdatedIDs ([IDField]) -- capture IDs that are updated
FROM [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
ON tmp.IDField = tab.IDField
... -- more fields if PK or alternate key is composite
IF (@@ROWCOUNT < @ImportRows) -- if all rows were updates then skip, else insert remaining
BEGIN
-- get rid of rows that were updates, leaving only the ones to insert
DELETE tmp
FROM #TempImport tmp
INNER JOIN @UpdatedIDs del
ON del.[IDField] = tmp.[IDField];
-- OR, rather than the DELETE, maybe add a column to #TempImport for:
-- [IsUpdate] BIT NOT NULL DEFAULT (0)
-- Then UPDATE #TempImport SET [IsUpdate] = 1 JOIN @UpdatedIDs ON [IDField]
-- Then, in below INSERT, add: WHERE [IsUpdate] = 0
INSERT INTO [SchemaName].[TableName]
(Field1, Field2, ...)
SELECT tmp.Field1, tmp.Field2, ...
FROM #TempImport tmp
END;
COMMIT TRAN;
END TRY
BEGIN CATCH
IF (@@TRANCOUNT > 0)
BEGIN
ROLLBACK;
END;
-- THROW; -- if using SQL 2012 or newer, use this and remove the following 3 lines
DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
RAISERROR(@ErrorMessage, 16, 1);
RETURN;
END CATCH;
I have used this model several times on Imports / ETLs that either have well over 1000 rows or maybe 500 in a batch out of a total set of 20k - over a million rows. However, I have not tested the performance difference between the DELETE of the updated rows out of the temp table vs just updating the [IsUpdate] field.
Please note regarding the decision to use XML over TVP due to there being, at most, 1000 rows to import at a time (mentioned in the question):
If this is being called a few times here and there, then quite possibly the minor performance gain in TVP might not be worth the additional maintenance cost (needing to drop the proc before changing the User-Defined Table Type, app code changes, etc). But if you are importing 4 million rows, sending 1000 at a time, that is 4000 executions (and 4 million rows of XML to parse no matter how it is broken up), and even a minor performance difference when executed only a few times will add up to a noticeable difference.
That being said, the method as I have described does not change outside of replacing the SELECT FROM @XmlInputParam to be SELECT FROM @TVP. Since TVPs are read-only, you wouldn't be able to delete from them. I guess you could simply add a WHERE NOT EXISTS(SELECT * FROM @UpdateIDs ids WHERE ids.IDField = tmp.IDField)
to that final SELECT (tied to the INSERT) instead of the simple WHERE IsUpdate = 0
. If you were to use the @UpdateIDs
table variable in this manner, then you could even get away with not dumping the incoming rows into the temp table.
Best Answer
I would have thought that the Audit functionality would meet your requirements. If you search for SQL Server Audit you'll get lots of hits - it is very configurable but too big to describe fully here. Have you looked at that ?