The larger the CACHE
on the sequence, the fewer times that you'll need to wait for Oracle to acquire the latch to be able to update the data dictionary when a new set of values need to be generated. There is definitely at least the potential for some performance benefit by increasing the size of the cache. On the other hand, if the CACHE
is already at 10,000, you're only going to need to update the data dictionary 20,000 times in the course of a 200 million row data load (assuming one call to the sequence.nextval per row). Without seeing your system, it seems unlikely that these 20,000 updates would account for a meaningful fraction of the time you're spending in a 200 million row data load. You would potentially shave a couple of seconds off the load by increasing the size of the cache but that probably isn't meaningful given the amount of time you're likely spending on I/O.
Assumptions
Since information is missing in the Q, I'll assume:
- Your data comes from a file on the database server.
- The data is formatted just like
COPY
output, with a unique id
per row to match the the target table.
If not, format it properly first or use COPY
options to deal with the format.
- You are updating every single row in the target table or most of them.
- You can afford to drop and recreate the target table.
That means no concurrent access. Else consider this related answer:
- There are no depending objects at all, except for indices.
Solution
I suggest you go with a similar approach as outlined at the link from your third bullet. With major optimizations.
To create the temporary table, there is a simpler and faster way:
CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;
A single big UPDATE
from a temporary table inside the database will be faster than individual updates from outside the database by several orders of magnitude.
In PostgreSQL's MVCC model, an UPDATE
means to create a new row version and mark the old one as deleted. That's about as expensive as an INSERT
and a DELETE
combined. Plus, it leaves you with a lot of dead tuples. Since you are updating the whole table anyway, it would be faster overall to just create a new table and drop the old one.
If you have enough RAM available, set temp_buffers
(only for this session!) high enough to hold the temp table in RAM - before you do anything else.
To get an estimate how much RAM is needed, run a test with a small sample and use db object size functions:
SELECT pg_size_pretty(pg_relation_size('tmp_tbl')); -- complete size of table
SELECT pg_column_size(t) FROM tmp_tbl t LIMIT 10; -- size of sample rows
Complete script
SET temp_buffers = '1GB'; -- example value
CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;
COPY tmp_tbl FROM '/absolute/path/to/file';
CREATE TABLE tbl_new AS
SELECT t.col1, t.col2, u.field1, u.field2
FROM tbl t
JOIN tmp_tbl u USING (id);
-- Create indexes like in original table
ALTER TABLE tbl_new ADD PRIMARY KEY ...;
CREATE INDEX ... ON tbl_new (...);
CREATE INDEX ... ON tbl_new (...);
-- exclusive lock on tbl for a very brief time window!
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME TO tbl;
DROP TABLE tmp_tbl; -- will also be dropped at end of session automatically
Concurrent load
Concurrent operations on the table (which I ruled out in the assumptions at the start) will wait, once the table is locked near the end and fail as soon as the transaction is committed, because the table name is resolved to its OID immediately, but the new table has a different OID. The table stays consistent, but concurrent operations may get an exception and have to be repeated. Details in this related answer:
UPDATE route
If you (have to) go the UPDATE
route, drop any index that is not needed during the update and recreate it afterwards. It is much cheaper to create an index in one piece than to update it for every individual row. This may also allow for HOT updates.
I outlined a similar procedure using UPDATE
in this closely related answer on SO.
Best Answer
There certainly could be as there is a slight performance difference due to UPDATE 1:
However, how much of a difference there is would need to be measured by you on your system with your schema, and data, and system load. There are several factors that play into how much impact a non-updating UPDATE has:
UPDATE TableName SET Field1 = Field1
, then an Update Trigger will fire and indicate that the field was updated (if you check using either the UPDATE() or COLUMNS_UPDATED functions), and that the field in bothINSERTED
andDELETED
tables are the same value.Also, the following summary section is found in Paul White's article, The Impact of Non-Updating Updates (as noted by @spaghettidba in a comment on his answer):
Please keep in mind (especially if you don't follow the link to see Paul's full article), the following two items:
Non-updating updates still have some log activity, showing that a transaction is beginning and ending. It is just that no data modification happens (which is still a good savings).
As I stated above, you need to test on your system. Use the same research queries that Paul is using and see if you get the same results. I am seeing slightly different results on my system than what is shown in the article. Still no dirty pages to be written, but a little more log activity.
Simplistically, if you are just dealing with a single row, you can do the following:
For multiple rows, you can get the information needed to make that decision by using the
OUTPUT
clause. By capturing exactly what rows were updated, then you can narrow down the items to look up to know the difference between not updating rows that don't exist as opposed to not updating rows that exist but don't need the update.I show the basic implementation in the following answer:
How to avoid using Merge query when upserting multiple data using xml parameter?
The method shown in that answer doesn't filter out rows that exist yet do not need to be updated. That portion could be added, but you would first need to show exactly where you are getting your dataset that you are merging into
MyTable
. Are they coming from a temporary table? A table-valued parameter (TVP)?UPDATE 1:
I was finally able to do some testing and here is what I found regarding transaction log and locking. First, the schema for the table:
Next, the test updating the field to the value that it already has:
Results:
Finally, the test that filters out the update due to the value not changing:
Results:
As you can see, nothing is written to the Transaction Log when filtering out the row, as opposed to the two entries marking the beginning and ending of the Transaction. And while it is true that those two entries are almost nothing, they are still something.
Also, the locking of the PAGE and KEY resources is less restrictive when filtering out the rows that haven't changed. If no other processes are interacting with this table then it is probably a non-issue (but how likely is that, really?). Keep in mind that that testing shown in any of the linked blogs (and even my testing) implicitly assumes that there is no contention on the table since it is never part of the tests. Saying that non-updating updates are so light-weight that it doesn't pay to do the filtering needs to be taken with a grain of salt since the testing has been done, more or less, in a vacuum. But in Production, this table is most likely not isolated. Of course, it could very well be that the little bit of logging and more restrictive locks don't translate into less efficiency. So the most reliable source of information to answer this question? SQL Server. Specifically: your SQL Server. It will show you which method is better for your system :-).
UPDATE 2:
If the operations in which the new value is the same as the current value (i.e. no update) out number the operations in which the new value is different and the update is necessary, then the following pattern might prove to be even better, especially if there is a lot of contention on the table. The idea is to do a simple
SELECT
first to get the current value. If you don't get a value then you have your answer regarding theINSERT
. If you do have a value, you can do a simpleIF
and issue theUPDATE
only if it is needed.Results:
So there are only 2 locks acquired instead of 3, and both of these locks are Intent Shared, not Intent eXclusive or Intent Update (Lock Compatibility). Keeping in mind that each lock acquired will also get released, each lock is really 2 operations, so this new method is a total of 4 operations instead of the 6 operations in the originally proposed method. Considering this operation is running once every 15 ms (approximately, as stated by the O.P.), that is about 66 times per second. So the original proposal amounts to 396 lock/unlock operations per second, while this new method amounts to only 264 lock/unlock operations per second of even lighter-weight locks. This is not a guarantee of awesome performance, but certainly worth testing :-).