SQL Server – UPDATE Performance When No Data Changes

performancequery-performancesql serverupdate

If I have an UPDATE statement that does not actually change any data (because the data is already in the updated state). Is there any performance benefit in putting a check in the WHERE clause to prevent the update?

For example would there be any difference in execution speed between UPDATE 1 and UPDATE 2 in the following:

CREATE TABLE MyTable (ID int PRIMARY KEY, Value int);
INSERT INTO MyTable (ID, Value)
VALUES
    (1, 1),
    (2, 2),
    (3, 3);

-- UPDATE 1
UPDATE MyTable
SET
    Value = 2
WHERE
    ID = 2
    AND Value <> 2;
SELECT @@ROWCOUNT;

-- UPDATE 2
UPDATE MyTable
SET
    Value = 2
WHERE
    ID = 2;
SELECT @@ROWCOUNT;

DROP TABLE MyTable;

The reason I ask is that I need the row count to include the unchanged row so I know whether to do an insert if the ID does not exist. As such I used the UPDATE 2 form. If there is a performance benefit to using the UPDATE 1 form, is it possible to get the row count that I need somehow?

Best Answer

If I have an UPDATE statement that does not actually change any data (because the data is already in the updated state), is there any performance benefit in putting a check in the where clause to prevent the update?

There certainly could be as there is a slight performance difference due to UPDATE 1:

not actually updating any rows (hence nothing to write to disk, not even minimal log activity), and
taking out less restrictive locks than what are required for doing the actual update (hence better for concurrency) (Please see Update section towards the end)

However, how much of a difference there is would need to be measured by you on your system with your schema, and data, and system load. There are several factors that play into how much impact a non-updating UPDATE has:

the amount of contention on the table being updated
the number of rows being updated
if there are UPDATE Triggers on the table being updated (as noted by Mark in a comment on the Question). If you execute UPDATE TableName SET Field1 = Field1, then an Update Trigger will fire and indicate that the field was updated (if you check using either the UPDATE() or COLUMNS_UPDATED functions), and that the field in both INSERTED and DELETED tables are the same value.

Also, the following summary section is found in Paul White's article, The Impact of Non-Updating Updates (as noted by @spaghettidba in a comment on his answer):

SQL Server contains a number of optimisations to avoid unnecessary logging or page flushing when processing an UPDATE operation that will not result in any change to the persistent database.

Non-updating updates to a clustered table generally avoid extra logging and page flushing, unless a column that forms (part of) the cluster key is affected by the update operation.

If any part of the cluster key is ‘updated’ to the same value, the operation is logged as if data had changed, and the affected pages are marked as dirty in the buffer pool. This is a consequence of the conversion of the UPDATE to a delete-then-insert operation.

Heap tables behave the same as clustered tables, except they do not have a cluster key to cause any extra logging or page flushing. This remains the case even where a non-clustered primary key exists on the heap. Non-updating updates to a heap therefore generally avoid the extra logging and flushing (but see below).

Both heaps and clustered tables will suffer the extra logging and flushing for any row where a LOB column containing more than 8000 bytes of data is updated to the same value using any syntax other than ‘SET column_name = column_name’.

Simply enabling either type of row versioning isolation level on a database always causes the extra logging and flushing. This occurs regardless of the isolation level in effect for the update transaction.

Please keep in mind (especially if you don't follow the link to see Paul's full article), the following two items:

Non-updating updates still have some log activity, showing that a transaction is beginning and ending. It is just that no data modification happens (which is still a good savings).
As I stated above, you need to test on your system. Use the same research queries that Paul is using and see if you get the same results. I am seeing slightly different results on my system than what is shown in the article. Still no dirty pages to be written, but a little more log activity.

... I need the row count to include the unchanged row so I know whether to do an insert if the ID does not exist. ... is it possible to get the row count that I need somehow?

Simplistically, if you are just dealing with a single row, you can do the following:

UPDATE MyTable
SET    Value = 2
WHERE  ID = 2
AND Value <> 2;

IF (@@ROWCOUNT = 0)
BEGIN
  IF (NOT EXISTS(
                 SELECT *
                 FROM   MyTable
                 WHERE  ID = 2 -- or Value = 2 depending on the scenario
                )
     )
  BEGIN
     INSERT INTO MyTable (ID, Value) -- or leave out ID if it is an IDENTITY
     VALUES (2, 2);
  END;
END;

For multiple rows, you can get the information needed to make that decision by using the OUTPUT clause. By capturing exactly what rows were updated, then you can narrow down the items to look up to know the difference between not updating rows that don't exist as opposed to not updating rows that exist but don't need the update.

I show the basic implementation in the following answer:

How to avoid using Merge query when upserting multiple data using xml parameter?

The method shown in that answer doesn't filter out rows that exist yet do not need to be updated. That portion could be added, but you would first need to show exactly where you are getting your dataset that you are merging into MyTable. Are they coming from a temporary table? A table-valued parameter (TVP)?

UPDATE 1:

I was finally able to do some testing and here is what I found regarding transaction log and locking. First, the schema for the table:

CREATE TABLE [dbo].[Test]
(
  [ID] [int] NOT NULL CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED,
  [StringField] [varchar](500) NULL
);

Next, the test updating the field to the value that it already has:

UPDATE rt
SET    rt.StringField = '04CF508B-B78E-4264-B9EE-E87DC4AD237A'
FROM   dbo.Test rt
WHERE  rt.ID = 4082117

Results:

-- Transaction Log (2 entries):
Operation
----------------------------
LOP_BEGIN_XACT
LOP_COMMIT_XACT


-- SQL Profiler (3 Lock:Acquired events):
Mode            Type
--------------------------------------
8 - IX          5 - OBJECT
8 - IX          6 - PAGE
5 - X           7 - KEY

Finally, the test that filters out the update due to the value not changing:

UPDATE rt
SET    rt.StringField = '04CF508B-B78E-4264-B9EE-E87DC4AD237A'
FROM   dbo.Test rt
WHERE  rt.ID = 4082117
AND    rt.StringField <> '04CF508B-B78E-4264-B9EE-E87DC4AD237A';

Results:

-- Transaction Log (0 entries):
Operation
----------------------------


-- SQL Profiler (3 Lock:Acquired events):
Mode            Type
--------------------------------------
8 - IX          5 - OBJECT
7 - IU          6 - PAGE
4 - U           7 - KEY

As you can see, nothing is written to the Transaction Log when filtering out the row, as opposed to the two entries marking the beginning and ending of the Transaction. And while it is true that those two entries are almost nothing, they are still something.

Also, the locking of the PAGE and KEY resources is less restrictive when filtering out the rows that haven't changed. If no other processes are interacting with this table then it is probably a non-issue (but how likely is that, really?). Keep in mind that that testing shown in any of the linked blogs (and even my testing) implicitly assumes that there is no contention on the table since it is never part of the tests. Saying that non-updating updates are so light-weight that it doesn't pay to do the filtering needs to be taken with a grain of salt since the testing has been done, more or less, in a vacuum. But in Production, this table is most likely not isolated. Of course, it could very well be that the little bit of logging and more restrictive locks don't translate into less efficiency. So the most reliable source of information to answer this question? SQL Server. Specifically: your SQL Server. It will show you which method is better for your system :-).

UPDATE 2:

If the operations in which the new value is the same as the current value (i.e. no update) out number the operations in which the new value is different and the update is necessary, then the following pattern might prove to be even better, especially if there is a lot of contention on the table. The idea is to do a simple SELECT first to get the current value. If you don't get a value then you have your answer regarding the INSERT. If you do have a value, you can do a simple IF and issue the UPDATE only if it is needed.

DECLARE @CurrentValue VARCHAR(500) = NULL,
        @NewValue VARCHAR(500) = '04CF508B-B78E-4264-B9EE-E87DC4AD237A',
        @ID INT = 4082117;

SELECT @CurrentValue = rt.StringField
FROM   dbo.Test rt
WHERE  rt.ID = @ID;

IF (@CurrentValue IS NULL) -- if NULL is valid, use @@ROWCOUNT = 0
BEGIN
  -- row does not exist
  INSERT INTO dbo.Test (ID, StringField)
  VALUES (@ID, @NewValue);
END;
ELSE
BEGIN
  -- row exists, so check value to see if it is different
  IF (@CurrentValue <> @NewValue)
  BEGIN
    -- value is different, so do the update
    UPDATE rt
    SET    rt.StringField = @NewValue
    FROM   dbo.Test rt
    WHERE  rt.ID = @ID;
  END;
END;

Results:

-- Transaction Log (0 entries):
Operation
----------------------------


-- SQL Profiler (2 Lock:Acquired events):
Mode            Type
--------------------------------------
6 - IS          5 - OBJECT
6 - IS          6 - PAGE

So there are only 2 locks acquired instead of 3, and both of these locks are Intent Shared, not Intent eXclusive or Intent Update (Lock Compatibility). Keeping in mind that each lock acquired will also get released, each lock is really 2 operations, so this new method is a total of 4 operations instead of the 6 operations in the originally proposed method. Considering this operation is running once every 15 ms (approximately, as stated by the O.P.), that is about 66 times per second. So the original proposal amounts to 396 lock/unlock operations per second, while this new method amounts to only 264 lock/unlock operations per second of even lighter-weight locks. This is not a guarantee of awesome performance, but certainly worth testing :-).

Assumptions

Since information is missing in the Q, I'll assume:

Your data comes from a file on the database server.
The data is formatted just like COPY output, with a unique id per row to match the the target table.
If not, format it properly first or use COPY options to deal with the format.
You are updating every single row in the target table or most of them.
You can afford to drop and recreate the target table.
That means no concurrent access. Else consider this related answer:
- Best way to populate a new column in a large table?
There are no depending objects at all, except for indices.

Solution

I suggest you go with a similar approach as outlined at the link from your third bullet. With major optimizations.

To create the temporary table, there is a simpler and faster way:

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

A single big UPDATE from a temporary table inside the database will be faster than individual updates from outside the database by several orders of magnitude.

In PostgreSQL's MVCC model, an UPDATE means to create a new row version and mark the old one as deleted. That's about as expensive as an INSERT and a DELETE combined. Plus, it leaves you with a lot of dead tuples. Since you are updating the whole table anyway, it would be faster overall to just create a new table and drop the old one.

If you have enough RAM available, set temp_buffers (only for this session!) high enough to hold the temp table in RAM - before you do anything else.

To get an estimate how much RAM is needed, run a test with a small sample and use db object size functions:

SELECT pg_size_pretty(pg_relation_size('tmp_tbl'));  -- complete size of table
SELECT pg_column_size(t) FROM tmp_tbl t LIMIT 10;  -- size of sample rows

Complete script

SET temp_buffers = '1GB';        -- example value

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

COPY tmp_tbl FROM '/absolute/path/to/file';

CREATE TABLE tbl_new AS
SELECT t.col1, t.col2, u.field1, u.field2
FROM   tbl     t
JOIN   tmp_tbl u USING (id);

-- Create indexes like in original table
ALTER TABLE tbl_new ADD PRIMARY KEY ...;
CREATE INDEX ... ON tbl_new (...);
CREATE INDEX ... ON tbl_new (...);

-- exclusive lock on tbl for a very brief time window!
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME TO tbl;

DROP TABLE tmp_tbl; -- will also be dropped at end of session automatically

Concurrent load

Concurrent operations on the table (which I ruled out in the assumptions at the start) will wait, once the table is locked near the end and fail as soon as the transaction is committed, because the table name is resolved to its OID immediately, but the new table has a different OID. The table stays consistent, but concurrent operations may get an exception and have to be repeated. Details in this related answer:

Best way to populate a new column in a large table?

UPDATE route

If you (have to) go the UPDATE route, drop any index that is not needed during the update and recreate it afterwards. It is much cheaper to create an index in one piece than to update it for every individual row. This may also allow for HOT updates.

I outlined a similar procedure using UPDATE in this closely related answer on SO.

Best Answer

Related Solutions

Oracle sequence generation performance during persists of large data sets

PostgreSQL – Optimizing Bulk Update Performance

Assumptions

Solution

Complete script

Concurrent load

UPDATE route

Related Question