Postgresql – Optimizing PostgreSQL for transient data

database-tuningoptimizationpostgresql

I have several tables with 100-300 columns of integer types each, that hold highly volatile data. The datasets are keyed by one or two primary keys, and when refresh occurs, the whole dataset is deleted and new data is inserted in one transaction. Dataset size is usually a few hundred rows, but can be up to several thousand rows in extreme cases. Refresh occurs once per second, and dataset updates for different keys are usually disjointed, so dropping and recreating the table is not feasible.

How do I tune Postgres to handle such load? I can use the latest and greatest version if that makes any difference.

Best Answer

Depending on how many different data sets there are, one option would be to partition the tables per-dataset.

When a dataset is updated, BEGIN a new transaction, TRUNCATE the table, COPY the new data into it, and COMMIT. PostgreSQL has an optimisation where COPYing into a table that's been TRUNCATEd in the same transaction does much less I/O if you're using wal_level = minimal (the default).

If you cannot partition and truncate (say, if you're dealing with tens or hundreds of thousands of data sets, where there'd just be too many tables) you'll instead want to crank autovacuum up to run as much as it can, make sure you have good indexes on anything you delete based on, and be prepared for somewhat ordinary performance.

If you don't need crash safety - you don't mind your tables being empty after a system crash - you can also create your tables as UNLOGGED, which will save you a huge amount of I/O cost.

If you don't mind having to restore the whole setup from a backup after a system crash you can go a step further and also setfsync=off, which basically says to PostgreSQL "don't bother with crash safety, I have good backups and I don't care if my data is permanently and totally unrecoverable after a crash, and I'm happy to re-initdb before I can use my database again".

I wrote some more about this in a similar thread on Stack Overflow about optimising PostgreSQL for fast testing; that mentions host OS tuning, separating WAL onto a different disk if you're not using unlogged tables, checkpointer adjustments, etc.

There's also some info in the Pg docs for fast data loading and non-durable settings.