PostgreSQL – Impact of Non-Durability Features on COPY Performance

copyperformancepostgresql

COPY is documented as not being WAL-logged if it's loaded on the same transactions..

In minimal level, WAL-logging of some bulk operations can be safely skipped, which can make those operations much faster (see Section 14.4.7). Operations in which this optimization can be applied include: CREATE TABLE AS,
CREATE INDEX,
CLUSTER,
COPY into tables that were created or truncated in the same transaction

With that said, do any of the non-durability features speed up a pg_restore -1?

-1 –single-transaction Execute the restore as a single transaction (that is, wrap the emitted commands in BEGIN/COMMIT). This ensures that either all the commands complete successfully, or no changes are applied. This option implies –exit-on-error.

COPY is a single transaction, will synchronous_commit do anything? Or does that only speed up batch-INSERTS?
Does making the initial table UNLOGGED matter if COPY ignores the WAL log anyway?
What about max_wal_size and checkpoint_timeout? I assume they would do nothing without WAL-logging?
To what extent does turning off fsync do anything other than playing with fire? Are we likely to see a speed up by turning it off?

question inspired by @Craig Ringer's answer on SO

Best Answer

The answer you refer to is talking about bulkloading a DB, not restoring a dump.

synchronous_commit will have minimal to no effect on COPY, since there's only one commit at the end.

COPY can only reduce WAL, not ignore it, and only if wal_level = minimal. Most production systems run with wal_level = hot_standby (in pg10, wal_level = replica) or higher. UNLOGGED tables absolutely do benefit COPY at higher wal_level, but only if you do not later plan on making the table logged. If you do, you're just deferring the WAL cost, since it'll be paid when you convert the UNLOGGED table to LOGGED instead of at COPY time. Use UNLOGGED tables if you can regenerate and reload the data later in case of crash and you don't need it to replicate. pg_restore will restore unlogged tables as unlogged, and logged tables as logged, so it's kind of a moot point.

Don't turn fsync off unless you really know what you're doing, or you don't care about your data. For bulk loading a new blank PostgreSQL instance, wal_level = minimal and synchronous_commit = off are probably going to be just as good. Better, even, since you avoid a whole bunch of WAL writing.

max_wal_size won't make much difference with bulk loading if wal_level = minimal. If you're using a higher WAL level you do need to tune checkpoints for best results. checkpoint_timeout doesn't really matter since it generally only comes into play on idle-ish systems.

So if you're using pg_restore to a system configured for physical replication (hot standby/warm standby/archive PITR) you've got to pay the costs for that, and there's not much you can do to speed things up on top of what COPY already does. pg_restore uses COPY.

Best Answer

Related Solutions

Postgresql – How does PostgreSQL handle Checkpoints in the middle of a WAL-enabled backup

PostgreSQL – Fixing Transaction OOM on 100k DDL Statements

Related Question