PostgreSQL upsert implications on read performance after bulk load

postgresql

In PostgreSQL documentation on populating databases, they mention how we can improve bulk loading operations by disabling constraints and indexes, but seeing how INSERT … ON CONFLICT (upsert) requires the use of a unique constraint to work, and that perfectly makes sense, I've been wondering how good the following strategy is:

Create multiple connections to load data in batches in parallel into a staging unlogged table
Create deferred constraints to do integrity validation on the table's data
Upsert the staging data into a big target table (5GB+), which also would need to have a PK, unique index or unique constraint to make the upsert possible.

The insert on conflict approach is known to be better than manualing creating SQL functions or scripts to join between the two tables to discover the "new records" to insert and "common records" to be updated. That's the whole point of why INSERT … ON CONFLICT exists.

Yet I'm still wondering the performance implications for reading the data after loading it.

Questions:

In terms of index bloat, an upsert would be worse than simple bulk copying inserts + update to a table without constraints? Is upsert bad for index maintenance?
If that is the case, I believe faster loads would also imply bad query performance after the table is loaded. Should I recreate the indexes after INSERT … ON CONFLICT?
Compared to INSERT … ON CONFLICT, the function/script way of doing it, while less performant for loads, would be better for index maitenance since the target table doesn't need to have an index to do the upserting?

Best Answer

That's the whole point of why INSERT ... ON CONFLICT exists.

I don't think that that is the case. I think a pretty big point was a convenience so people don't have to implement a catch and retry loop for unique constraint violations upon single-row insertions. Not rolling back entire bulk inserts due to a few violations is also a pretty good improvement, but I don't think it is the whole point. And if you are going to have the table offline to other uses anyway, think I think it is actually none of the point of ON CONFLICT.

ON CONFLICT pays a high price to handle concurrency at a fine grained level. If you can guarantee that only this procedure is inserting new records, then one query to do the insertion of new keys and a second query to do the update of olds ones would almost certainly be more performant than using ON CONFLICT. If your point 3 means that the big table had all constraints and indexes dropped other than the one needed to support the ON CONFLICT, then it seems likely you are in a maintenance window, and so you can probably make the guarantee about no concurrent inserts.

In terms of index bloat, an upsert would be worse than simple bulk copying inserts + update to a table without constraints? Is upsert bad for index maintenance?

I would not expect this to be the case in general, if you are keeping the unique constraint either way. But it would depend on things like whether the unique key values of rows being inserted are mostly occurring in order, or random. Also, bloat would depend on whether rows taking the UPDATE path have room for the updated row in the same block the old version originally resided in (so the table fill-factor), but this would also be true if the UPDATE were a separate statement. You would have to mock up something similar to your actual conditions and test it to get clear answers.

Compared to INSERT ... ON CONFLICT, the function/script way of doing it, while less performant for loads, would be better for index maintenance since the target table doesn't need to have an index to do the upserting?

If you take the table offline for normal use and drop all constraints and indexes, then it should be both more performant and lead to less bloated indexes (although perhaps only slightly less bloated) once you rebuild them. That is the case whether you compare it to either using ON CONFLICT online, or compare it to dropping all-but-one of the constraints and running ON CONFLICT offline.

Assumptions

Since information is missing in the Q, I'll assume:

Your data comes from a file on the database server.
The data is formatted just like COPY output, with a unique id per row to match the the target table.
If not, format it properly first or use COPY options to deal with the format.
You are updating every single row in the target table or most of them.
You can afford to drop and recreate the target table.
That means no concurrent access. Else consider this related answer:
- Best way to populate a new column in a large table?
There are no depending objects at all, except for indices.

Solution

I suggest you go with a similar approach as outlined at the link from your third bullet. With major optimizations.

To create the temporary table, there is a simpler and faster way:

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

A single big UPDATE from a temporary table inside the database will be faster than individual updates from outside the database by several orders of magnitude.

In PostgreSQL's MVCC model, an UPDATE means to create a new row version and mark the old one as deleted. That's about as expensive as an INSERT and a DELETE combined. Plus, it leaves you with a lot of dead tuples. Since you are updating the whole table anyway, it would be faster overall to just create a new table and drop the old one.

If you have enough RAM available, set temp_buffers (only for this session!) high enough to hold the temp table in RAM - before you do anything else.

To get an estimate how much RAM is needed, run a test with a small sample and use db object size functions:

SELECT pg_size_pretty(pg_relation_size('tmp_tbl'));  -- complete size of table
SELECT pg_column_size(t) FROM tmp_tbl t LIMIT 10;  -- size of sample rows

Complete script

SET temp_buffers = '1GB';        -- example value

CREATE TEMP TABLE tmp_tbl AS SELECT * FROM tbl LIMIT 0;

COPY tmp_tbl FROM '/absolute/path/to/file';

CREATE TABLE tbl_new AS
SELECT t.col1, t.col2, u.field1, u.field2
FROM   tbl     t
JOIN   tmp_tbl u USING (id);

-- Create indexes like in original table
ALTER TABLE tbl_new ADD PRIMARY KEY ...;
CREATE INDEX ... ON tbl_new (...);
CREATE INDEX ... ON tbl_new (...);

-- exclusive lock on tbl for a very brief time window!
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME TO tbl;

DROP TABLE tmp_tbl; -- will also be dropped at end of session automatically

Concurrent load

Concurrent operations on the table (which I ruled out in the assumptions at the start) will wait, once the table is locked near the end and fail as soon as the transaction is committed, because the table name is resolved to its OID immediately, but the new table has a different OID. The table stays consistent, but concurrent operations may get an exception and have to be repeated. Details in this related answer:

Best way to populate a new column in a large table?

UPDATE route

If you (have to) go the UPDATE route, drop any index that is not needed during the update and recreate it afterwards. It is much cheaper to create an index in one piece than to update it for every individual row. This may also allow for HOT updates.

I outlined a similar procedure using UPDATE in this closely related answer on SO.

PostgreSQL UPSERT issue with NULL values

Clarify `ON CONFLICT DO UPDATE` behavior

Consider the manual here:

For each individual row proposed for insertion, either the insertion proceeds, or, if an arbiter constraint or index specified by conflict_target is violated, the alternative conflict_action is taken.

Bold emphasis mine. So you do not have to repeat predicates for columns included in the unique index in the WHERE clause to the UPDATE (the conflict_action):

INSERT INTO test_upsert AS tu
       (name   , status, test_field  , identifier, count) 
VALUES ('shaun', 1     , 'test value', 'ident'   , 1)
ON CONFLICT (name, status, test_field) DO UPDATE
SET count = tu.count + 1;
WHERE tu.name = 'shaun' AND tu.status = 1 AND tu.test_field = 'test value'

The unique violation already establishes what your added WHERE clause would enforce redundantly.

Clarify partial index

Add a WHERE clause to make it an actual partial index like you mentioned yourself (but with inverted logic):

CREATE UNIQUE INDEX test_upsert_partial_idx
ON public.test_upsert (name, status)
WHERE test_field IS NULL;  -- not: "is not null"

To use this partial index in your UPSERT you need a matching conflict_target like @ypercube demonstrates:

ON CONFLICT (name, status) WHERE test_field IS NULL

Now the above partial index is inferred. However, as the manual also notes:

[...] a non-partial unique index (a unique index without a predicate) will be inferred (and thus used by ON CONFLICT) if such an index satisfying every other criteria is available.

If you have an additional (or only) index on just (name, status) it will (also) be used. An index on (name, status, test_field) would explicitly not be inferred. This doesn't explain your problem, but may have added to the confusion while testing.

Solution

AIUI, none of the above solves your problem, yet. With the partial index, only special cases with matching NULL values would be caught. And other duplicate rows would either be inserted if you have no other matching unique indexes / constraints, or raise an exception if you do. I suppose that's not what you want. You write:

The composite key is made up of 20 columns, 10 of which can be nullable.

What exactly do you consider a duplicate? Postgres (according to the SQL standard) does not consider two NULL values to be equal. The manual:

In general, a unique constraint is violated if there is more than one row in the table where the values of all of the columns included in the constraint are equal. However, two null values are never considered equal in this comparison. That means even in the presence of a unique constraint it is possible to store duplicate rows that contain a null value in at least one of the constrained columns. This behavior conforms to the SQL standard, but we have heard that other SQL databases might not follow this rule. So be careful when developing applications that are intended to be portable.

Allow null in unique column

I assume you want NULL values in all 10 nullable columns to be considered equal. It is elegant & practical to cover a single nullable column with an additional partial index like demonstrated here:

PostgreSQL multi-column unique constraint and NULL values

But this gets out of hand quickly for more nullable columns. You'd need a partial index for every distinct combination of nullable columns. For just 2 of those that's 3 partial indexes for (a), (b) and (a,b). The number is growing exponentially with 2^n - 1. For your 10 nullable columns, to cover all possible combinations of NULL values, you'd already need 1023 partial indexes. No go.

The simple solution: replace NULL values and define involved columns NOT NULL, and everything would work just fine with a simple UNIQUE constraint.

If that's not an option I suggest an expression index with COALESCE to replace NULL in the index:

CREATE UNIQUE INDEX test_upsert_solution_idx
    ON test_upsert (name, status, COALESCE(test_field, ''));

The empty string ('') is an obvious candidate for character types, but you can use any legal value that either never appears or can be folded with NULL according to your definition of "unique".

Then use this statement:

INSERT INTO test_upsert as tu(name,status,test_field,identifier, count) 
VALUES ('shaun', 1, null        , 'ident', 11)  -- works with
     , ('bob'  , 2, 'test value', 'ident', 22)  -- and without NULL
ON     CONFLICT (name, status, COALESCE(test_field, '')) DO UPDATE  -- match expr. index
SET    count = COALESCE(tu.count + EXCLUDED.count, EXCLUDED.count, tu.count);

Like @ypercube I assume you actually want to add count to the existing count. Since the column can be NULL, adding NULL would set the column NULL. If you define count NOT NULL, you can simplify.

Another idea would be to just drop the conflict_target from the statement to cover all unique violations. Then you could define various unique indexes for a more sophisticated definition of what's supposed to be "unique". But that won't fly with ON CONFLICT DO UPDATE. The manual once more:

For ON CONFLICT DO NOTHING, it is optional to specify a conflict_target; when omitted, conflicts with all usable constraints (and unique indexes) are handled. For ON CONFLICT DO UPDATE, a conflict_target must be provided.

Best Answer

Related Solutions

Postgresql – Optimizing bulk update performance in PostgreSQL

Assumptions

Solution

Complete script

Concurrent load

UPDATE route

PostgreSQL UPSERT issue with NULL values

Clarify ON CONFLICT DO UPDATE behavior

Clarify partial index

Solution

Related Question

Clarify `ON CONFLICT DO UPDATE` behavior