Postgresql – Optimizing bulk update performance in Postgresql with dependencies

dependenciesperformancepostgresqlupdate

Basically my question is the same as this one, but WITH dependencies, so drop/renaming the table is not a trivial option (I assume).

We are refactoring a large, poorly designed table which has many columns and references to it. It currently has a text field that should be a foreign key. The naive update looks like:

ALTER TABLE ADD COLUMN new_id int REFERENCES list(id);
UPDATE myTable SET new_id=(SELECT id FROM list WHERE name=old_text);
ALTER TABLE myTable DROP COLUMN old_text;

The above takes practically forever because the table is large, and basically gets temporarily doubled due to UPDATE being equivalent to INSERT/DELETE.

We do not need everything done in one transaction. So we are considering some sort of external script to do the updates in batches of 5000 or so, but tests indicate it will still be painful/slow.

Suggestions on how to improve performance?

Best Answer

Given that you can not afford to drop and recreate the table, this related answer would be a better fit:

Best way to populate a new column in a large table?

You might drop expendable indexes and recreate them when you are done (if they aren't completely expendable).

And all the general advice for performance optimization applies.

There is not much more you can do, if you have to update the table bit by bit. Faster alternatives drop and rewrite / recreate from scratch.

Assumptions

Since information is missing in the Q, I'll assume:

Your data comes from a file on the database server.
The data is formatted just like COPY output, with a unique id per row to match the the target table.
If not, format it properly first or use COPY options to deal with the format.
You are updating every single row in the target table or most of them.
You can afford to drop and recreate the target table.
That means no concurrent access. Else consider this related answer:
- Best way to populate a new column in a large table?
There are no depending objects at all, except for indices.

Solution

I suggest you go with a similar approach as outlined at the link from your third bullet. With major optimizations.

To create the temporary table, there is a simpler and faster way:

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

A single big UPDATE from a temporary table inside the database will be faster than individual updates from outside the database by several orders of magnitude.

In PostgreSQL's MVCC model, an UPDATE means to create a new row version and mark the old one as deleted. That's about as expensive as an INSERT and a DELETE combined. Plus, it leaves you with a lot of dead tuples. Since you are updating the whole table anyway, it would be faster overall to just create a new table and drop the old one.

If you have enough RAM available, set temp_buffers (only for this session!) high enough to hold the temp table in RAM - before you do anything else.

To get an estimate how much RAM is needed, run a test with a small sample and use db object size functions:

SELECT pg_size_pretty(pg_relation_size('tmp_tbl'));  -- complete size of table
SELECT pg_column_size(t) FROM tmp_tbl t LIMIT 10;  -- size of sample rows

Complete script

SET temp_buffers = '1GB';        -- example value

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

COPY tmp_tbl FROM '/absolute/path/to/file';

CREATE TABLE tbl_new AS
SELECT t.col1, t.col2, u.field1, u.field2
FROM   tbl     t
JOIN   tmp_tbl u USING (id);

-- Create indexes like in original table
ALTER TABLE tbl_new ADD PRIMARY KEY ...;
CREATE INDEX ... ON tbl_new (...);
CREATE INDEX ... ON tbl_new (...);

-- exclusive lock on tbl for a very brief time window!
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME TO tbl;

DROP TABLE tmp_tbl; -- will also be dropped at end of session automatically

Concurrent load

Concurrent operations on the table (which I ruled out in the assumptions at the start) will wait, once the table is locked near the end and fail as soon as the transaction is committed, because the table name is resolved to its OID immediately, but the new table has a different OID. The table stays consistent, but concurrent operations may get an exception and have to be repeated. Details in this related answer:

Best way to populate a new column in a large table?

UPDATE route

If you (have to) go the UPDATE route, drop any index that is not needed during the update and recreate it afterwards. It is much cheaper to create an index in one piece than to update it for every individual row. This may also allow for HOT updates.

I outlined a similar procedure using UPDATE in this closely related answer on SO.

PostgreSQL – How to Perform Bulk Update of All Columns

If you only want to update data, I'm not sure what the INSERT statement is for in your question.

If you just want to update several rows with a single statement, you might be looking for this:

with update_values (ID,PARENT_ID,BOUGHT_IN_FORM_TYPE_ID,PRIORITY,NAME,HEADING,DESCRIPTION,ICON,BOUGHT_IN_CONTROL_PANEL_FILE_ID) as 
(
  VALUES
     (109,1,28,100,'Tooling','Tooling','Enter your Machine Tools here','tooling.png',null), 
     (1,0,1,200,'Bought In','Bought In','','boughtin.png',null)
)
update bought_in_control_panel
   set parent_id = ud.parent_id, 
       bought_in_form_type_id = ud.bought_in_form_type_id,
       ....
from update_values ud
where ud.id = bought_in_control_panel.id;