Postgresql – Postgres Upsert Performance Considerations (millions of rows/hour)

performancepostgresqlupsert

I have tried doing some research for quite a while on the performance implications of the built-in upsert (INSERT … ON CONFLICT UPDATE…) when a lot of upserts are made. The scale is something like 1 million records/hour, that is split up in groups of around 300 records each.

The table is fairly simple, just a few int/bool fields and a date field. The total size of the table is somewhere between 50-60 million records. Essentially, all the rows are supposed to stay up-to-date within a certain cycle (that is currently about 2 days). Sometimes a lot of information changes, sometimes very little/no information changes (e.g. 300 records get upserted but they are identical to the existing records).

So far, one hypothesis is that this project seems to be suffering from the large amount of writes that happen constantly since even if the upsert results in no inserts/updates, the "failed" inserts from the upsert will still get written somewhere (according to our knowledge). Therefore, the idea is to utilize old-fashioned upserts (writeable CTEs) and do more granular operations that can make sure to only insert data that doesn't already exist, and only update data that has actually changed. Naturally, however, this will put more read-load on the DB and increase query complexity.

The question is, are we right in our assumptions that the built-in upsert is useless from a performance perspective (e.g. it's only good for much smaller tasks) or are we wrong? I read a bit about HOT updates and autovacuum tuning, but nothing that references something more similar to this question.

Worth mentioning is that this DB (PostgreSQL 10.9) is running on Heroku so we are not able to tune it to our needs. We are planning to move at some point, depending on how important it ends up being. Finally, it is also worth mentioning that this DB also has one follower (i.e. read replica).

Would really appreciate some good insight on this! Thanks beforehand.

Best Answer

INSERT ... ON CONFLICT is the fastest way to upsert data. If it is fast enough for you can only be seen from a performance test.

It matters a lot if you use ON CONFLICT DO NOTHING or if you use an UPDATE clause:

With DO NOTHING, there won't be any dead tuples that have to be cleaned up. If you perform an UPDATE, there will be a dead tuple, and cleaning up these dead tuples may turn out to be the bottleneck in your processing.

Use a fillfactor of 70 or so and make sure that none of the updated columns are indexed. Then you can get HOT updates which may save your day.