Sql-server – Efficiently bulk upsert unrelated rows

etlsql servertable-valued-parameterstransactional-replicationupsert

As part of a manual replication process between databases with different but related schemas, for each table, we identify new and modified records in the source database and feed them via a table valued parameter to a stored procedure in the SQL Server destination database. The source database is not SQL Server and is in a different data center.

For each row to be upserted, the stored procedure queries for the primary keys of the related rows already in the destination database.

Currently, we do a single SELECT to get the primary keys, followed by a single MERGE to perform the upsert. There are two aspects of this approach that I believe may not be as efficient as possible.

  1. An implicit transaction unnecessarily wraps the MERGE. The database would remain consistent even with each row being upserted one at a time. If an row's upsert fails, we want the remaining rows to proceed.

  2. MERGE interleaves inserts and sets as it goes through the rows, which is fine, but we don't need this. It would be acceptable to set all the modified rows and subsequently insert all the new rows.

Based on the flexibility we have, the MERGE performance tip to use UPDATE and INSERT seems to apply to our case:

When simply updating one table based on the rows of another table, improved performance and scalability can be achieved with basic INSERT, UPDATE, and DELETE statements.

Do I understand that right? Am I better off with a separate UPDATE and INSERT? And what about the implicit transaction? Is performing a single SELECT, UPDATE, and INSERT over a large batch of rows most efficient, or is it better to take advantage of the ability to do one row at time by using a FOR loop? Or something else?

In general, what is the most efficient way to upsert a large batch or rows to a SQL Server table in which the rows are not transactionally related and sets and inserts need not be interleaved?

Best Answer

To elaborate on my comment, I would use a set based approach for better performance with error handling, here is how I would do it. I mixed the two common methods, using JOIN and NOT EXISTS / EXISTS.

Note, I don't use a temp table or table variable here. I've seen performance gains when the databases are on different servers but pulling the data into a temp table before doing an upsert across a linked server or whatever, but I'm not sure if your databases are on the same instance or not.

set xact_abort on;
begin try
    begin tran

    update d
    set d.column = s.column
    from destinationTable d
    inner join sourceTable s on
    s.id = d.id

    insert into destinationTable (...column list...)
    select s.1, s.2, ...
    from sourceTable s
    where not exists(select id from destinationTable d where d.id = s.id)

    commit
end try
begin catch
    if @@trancount > 0 rollback tran
    declare @msg nvarchar(2048) = error_message()
    raiserror (@msg, 16, 1)
    return 55555
end catch

Some good reads

In the case that the data is in on a different server a temp table could increase the performance of the insert (or may be the only option if it's not SQL Server). The above code would be the same, except the sourceTable would be replaced with #sourceTable.

If the table is large, then you may want to consider staging tables instead of #tempTables so that you don't beat up tempdb too bad. BULK INSERT into the staging tables (sourceTable) and then do your upsert from here.