Postgresql – Most efficient way to do many UPDATES in Postgresql

postgresqlupdate

I have to update cca 2.5 milion rows in a table with cca 20 milion rows.

The updated column value is based on value of another column, but another select is required to determine it's value. The column being updated is also a part of an index.

Currently I generated sql script with one UPDATE statement per row, but I imagine this is not very efficient. What's the most efficient way and are there any post update actions to perform to keep the table performing well? The table is being constantly used (selected/updated/inserted to), I don't want this to cause any downtime. The version of postgresql is 9.1.

EDIT

The updates in the script are of form

UPDATE table1 SET fid=123 WHERE id=345;

Where id column is integer and the primary key of the table.

The schema looks like this:

CREATE TABLE table1 (
    id serial PRIMARY KEY,
    date timestamp,
    rname varchar(50),
    fid integer references table2(id)
);

CREATE INDEX ON table1 (fid, date);

CREATE TABLE table2 (
    id serial PRIMARY KEY,
    rname varchar(50)
);

The fid column was recently added and I need to update some records in table1 so the records are properly linked by record id and not a rname column which causes data redundancy and other problems.

I've written perl script that generates set of update statements to do so based on the current state of the tables. The new inserted records in table have fid properly filled, I need to update the old.

Best Answer

If you don't want to interfere with other activity, the one UPDATE at a time in autocommit mode is very likely the best option. You should probably set synchronous_commit=off in that session (and only that session).

The indexes are going to slow you down, perhaps by a lot depending on your RAM and your IO system. But if the index is necessary for the other actions you don't want to interfere with, then there isn't anything you can do about it.

But since the fid is not yet correctly populated, the index on it is probably not actually useful to the concurrent processes you want avoid interfering with, as they haven't been changed yet to rely on that column being accurate. If that is the case, you can drop that index to gain speed, and build it in bulk later. The same probably applies to the foreign key constraint.

Once that index is gone, your updates can proceed via HOT (Heap Only Tuples) updates provided each block has enough free space. In that case, the updates will not have to do maintenance on the primary key index, either, saving that much more IO. To maximize the likelihood that this will work optimally, it is important that each UPDATE be its own transaction. That way one UPDATE can reuse space freed up by an earlier one.

Also, your WHERE clause should probably be like:

WHERE id=345 and fid is not null;

That way if the script gets interrupted, you can re-run it with minimal damage.

Since you seem to be running this on a test system already, then an EXPLAIN (ANALYZE,BUFFERS) of some of the updates would be helpful, especially with track_io_timing set to on.