PostgreSQL upsert implications on read performance after bulk load

postgresql

In PostgreSQL documentation on populating databases, they mention how we can improve bulk loading operations by disabling constraints and indexes, but seeing how INSERT … ON CONFLICT (upsert) requires the use of a unique constraint to work, and that perfectly makes sense, I've been wondering how good the following strategy is:

  1. Create multiple connections to load data in batches in parallel into a staging unlogged table

  2. Create deferred constraints to do integrity validation on the table's data

  3. Upsert the staging data into a big target table (5GB+), which also would need to have a PK, unique index or unique constraint to make the upsert possible.

The insert on conflict approach is known to be better than manualing creating SQL functions or scripts to join between the two tables to discover the "new records" to insert and "common records" to be updated. That's the whole point of why INSERT … ON CONFLICT exists.

Yet I'm still wondering the performance implications for reading the data after loading it.

Questions:

  1. In terms of index bloat, an upsert would be worse than simple bulk copying inserts + update to a table without constraints? Is upsert bad for index maintenance?

  2. If that is the case, I believe faster loads would also imply bad query performance after the table is loaded. Should I recreate the indexes after INSERT … ON CONFLICT?

  3. Compared to INSERT … ON CONFLICT, the function/script way of doing it, while less performant for loads, would be better for index maitenance since the target table doesn't need to have an index to do the upserting?

Best Answer

That's the whole point of why INSERT ... ON CONFLICT exists.

I don't think that that is the case. I think a pretty big point was a convenience so people don't have to implement a catch and retry loop for unique constraint violations upon single-row insertions. Not rolling back entire bulk inserts due to a few violations is also a pretty good improvement, but I don't think it is the whole point. And if you are going to have the table offline to other uses anyway, think I think it is actually none of the point of ON CONFLICT.

ON CONFLICT pays a high price to handle concurrency at a fine grained level. If you can guarantee that only this procedure is inserting new records, then one query to do the insertion of new keys and a second query to do the update of olds ones would almost certainly be more performant than using ON CONFLICT. If your point 3 means that the big table had all constraints and indexes dropped other than the one needed to support the ON CONFLICT, then it seems likely you are in a maintenance window, and so you can probably make the guarantee about no concurrent inserts.

In terms of index bloat, an upsert would be worse than simple bulk copying inserts + update to a table without constraints? Is upsert bad for index maintenance?

I would not expect this to be the case in general, if you are keeping the unique constraint either way. But it would depend on things like whether the unique key values of rows being inserted are mostly occurring in order, or random. Also, bloat would depend on whether rows taking the UPDATE path have room for the updated row in the same block the old version originally resided in (so the table fill-factor), but this would also be true if the UPDATE were a separate statement. You would have to mock up something similar to your actual conditions and test it to get clear answers.

Compared to INSERT ... ON CONFLICT, the function/script way of doing it, while less performant for loads, would be better for index maintenance since the target table doesn't need to have an index to do the upserting?

If you take the table offline for normal use and drop all constraints and indexes, then it should be both more performant and lead to less bloated indexes (although perhaps only slightly less bloated) once you rebuild them. That is the case whether you compare it to either using ON CONFLICT online, or compare it to dropping all-but-one of the constraints and running ON CONFLICT offline.