PostgreSQL – Optimizing Bulk Update Performance

bulkperformancepostgresqlupdate

Using PG 9.1 on Ubuntu 12.04.

It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

UPDATE table
SET field1 = constant1, field2 = constant2, ...
WHERE id = constid

(We're just overwriting fields of objects identified by ID.) The values come from an external data source (not already in the DB in a table).

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a pg_dump of the entire DB. This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a data set
for PostgreSQL to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import? (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within PostgreSQL.)

Some ideas:

dropping all non-ID indices and rebuilding afterward?
increasing checkpoint_segments, but does this actually help sustained
long-term throughput?
using the techniques mentioned here? (Load new data as table, then
"merge in" old data where ID is not found in new data)

Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things. We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

I do have concurrent load on the table but it's read-only.

Best Answer

Assumptions

Since information is missing in the Q, I'll assume:

Your data comes from a file on the database server.
The data is formatted just like COPY output, with a unique id per row to match the the target table.
If not, format it properly first or use COPY options to deal with the format.
You are updating every single row in the target table or most of them.
You can afford to drop and recreate the target table.
That means no concurrent access. Else consider this related answer:
- Best way to populate a new column in a large table?
There are no depending objects at all, except for indices.

Solution

I suggest you go with a similar approach as outlined at the link from your third bullet. With major optimizations.

To create the temporary table, there is a simpler and faster way:

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

A single big UPDATE from a temporary table inside the database will be faster than individual updates from outside the database by several orders of magnitude.

In PostgreSQL's MVCC model, an UPDATE means to create a new row version and mark the old one as deleted. That's about as expensive as an INSERT and a DELETE combined. Plus, it leaves you with a lot of dead tuples. Since you are updating the whole table anyway, it would be faster overall to just create a new table and drop the old one.

If you have enough RAM available, set temp_buffers (only for this session!) high enough to hold the temp table in RAM - before you do anything else.

To get an estimate how much RAM is needed, run a test with a small sample and use db object size functions:

SELECT pg_size_pretty(pg_relation_size('tmp_tbl'));  -- complete size of table
SELECT pg_column_size(t) FROM tmp_tbl t LIMIT 10;  -- size of sample rows

Complete script

SET temp_buffers = '1GB';        -- example value

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

COPY tmp_tbl FROM '/absolute/path/to/file';

CREATE TABLE tbl_new AS
SELECT t.col1, t.col2, u.field1, u.field2
FROM   tbl     t
JOIN   tmp_tbl u USING (id);

-- Create indexes like in original table
ALTER TABLE tbl_new ADD PRIMARY KEY ...;
CREATE INDEX ... ON tbl_new (...);
CREATE INDEX ... ON tbl_new (...);

-- exclusive lock on tbl for a very brief time window!
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME TO tbl;

DROP TABLE tmp_tbl; -- will also be dropped at end of session automatically

Concurrent load

Concurrent operations on the table (which I ruled out in the assumptions at the start) will wait, once the table is locked near the end and fail as soon as the transaction is committed, because the table name is resolved to its OID immediately, but the new table has a different OID. The table stays consistent, but concurrent operations may get an exception and have to be repeated. Details in this related answer:

Best way to populate a new column in a large table?

UPDATE route

If you (have to) go the UPDATE route, drop any index that is not needed during the update and recreate it afterwards. It is much cheaper to create an index in one piece than to update it for every individual row. This may also allow for HOT updates.

I outlined a similar procedure using UPDATE in this closely related answer on SO.

Related Solutions

Sql-server – Best way to re-import large amount of data with minimal downtime

A solution I've used in the past (and have recommended here and on StackOverflow before) is to create two additional schemas:

CREATE SCHEMA shadow AUTHORIZATION dbo;
CREATE SCHEMA cache  AUTHORIZATION dbo;

Now create a mimic of your table in the cache schema:

CREATE TABLE cache.IPLookup(...columns...);

Now when you are doing your switch operation:

TRUNCATE TABLE cache.IPLookup;
BULK INSERT cache.IPLookup FROM ...;

-- the nice thing about the above is that it doesn't really
-- matter if it takes one minute or ten - you're not messing
-- with a table that anyone is using, so you aren't going to
-- interfere with active users.


-- this is a metadata operation so extremely fast - it will wait
-- for existing locks to be released, but won't block new locks
-- for very long at all:

BEGIN TRANSACTION;
  ALTER SCHEMA shadow TRANSFER    dbo.IPLookup;
  ALTER SCHEMA dbo    TRANSFER  cache.IPLookup;
COMMIT TRANSACTION;


-- now let's move the shadow table back over to
-- the cache schema so it's ready for next load:

ALTER SCHEMA cache TRANSFER shadow.IPLookup;
TRUNCATE TABLE cache.IPLookup; 

-- truncate is optional - I usually keep the data
-- around for debugging, but that's probably not
-- necessary in this case.

This will be more cumbersome if you have foreign keys and other dependencies (since you may have to drop those and re-create them), and of course it completely invalidates statistics etc. and this, in turn, can affect plans, but if the most important thing is getting accurate data in front of your users with minimal interruption, this can be an approach to consider.

Oracle 10g: Optimizing an update query that uses a subquery

Try the below to update unique issuer id and issuer name instead of using Distinct by using Group by together with the index on Issue Id and Issuer Name in Temporary reporting table

Kindly refer the below link Fast Query To Get TOP-N Distinct Rows From A Huge Data Table

UPDATE TBL_REPORTING REP
SET(REP.ISSUER_ID, REP.ISSUER_NAME
) = ( SELECT TMP.ISSUER_ID
                   ,TMP.ISSUER_NAME
            FROM   TMP_REPORTING TMP
            WHERE  TMP.CMD_KEY = REP.CMD_KEY
            AND    TMP.REPORTING_KEY = REP.REPORTING_KEY
            GROUP BY TMP.ISSUER_ID
                   ,TMP.ISSUER_NAME
        )
WHERE  REP.CMD_KEY = v_cmd_key;