You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
Clarify ON CONFLICT DO UPDATE
behavior
Consider the manual here:
For each individual row proposed for insertion, either the insertion
proceeds, or, if an arbiter constraint or index specified by
conflict_target
is violated, the alternative conflict_action
is taken.
Bold emphasis mine. So you do not have to repeat predicates for columns included in the unique index in the WHERE
clause to the UPDATE
(the conflict_action
):
INSERT INTO test_upsert AS tu
(name , status, test_field , identifier, count)
VALUES ('shaun', 1 , 'test value', 'ident' , 1)
ON CONFLICT (name, status, test_field) DO UPDATE
SET count = tu.count + 1;
WHERE tu.name = 'shaun' AND tu.status = 1 AND tu.test_field = 'test value'
The unique violation already establishes what your added WHERE
clause would enforce redundantly.
Clarify partial index
Add a WHERE
clause to make it an actual partial index like you mentioned yourself (but with inverted logic):
CREATE UNIQUE INDEX test_upsert_partial_idx
ON public.test_upsert (name, status)
WHERE test_field IS NULL; -- not: "is not null"
To use this partial index in your UPSERT you need a matching conflict_target
like @ypercube demonstrates:
ON CONFLICT (name, status) WHERE test_field IS NULL
Now the above partial index is inferred. However, as the manual also notes:
[...] a non-partial unique index (a unique index without a predicate) will
be inferred (and thus used by ON CONFLICT
) if such an index satisfying
every other criteria is available.
If you have an additional (or only) index on just (name, status)
it will (also) be used. An index on (name, status, test_field)
would explicitly not be inferred. This doesn't explain your problem, but may have added to the confusion while testing.
Solution
AIUI, none of the above solves your problem, yet. With the partial index, only special cases with matching NULL values would be caught. And other duplicate rows would either be inserted if you have no other matching unique indexes / constraints, or raise an exception if you do. I suppose that's not what you want. You write:
The composite key is made up of 20 columns, 10 of which can be nullable.
What exactly do you consider a duplicate? Postgres (according to the SQL standard) does not consider two NULL values to be equal. The manual:
In general, a unique constraint is violated if there is more than one
row in the table where the values of all of the columns included in
the constraint are equal. However, two null values are never
considered equal in this comparison. That means even in the presence
of a unique constraint it is possible to store duplicate rows that
contain a null value in at least one of the constrained columns. This
behavior conforms to the SQL standard, but we have heard that other
SQL databases might not follow this rule. So be careful when
developing applications that are intended to be portable.
Related:
I assume you want NULL
values in all 10 nullable columns to be considered equal. It is elegant & practical to cover a single nullable column with an additional partial index like demonstrated here:
But this gets out of hand quickly for more nullable columns. You'd need a partial index for every distinct combination of nullable columns. For just 2 of those that's 3 partial indexes for (a)
, (b)
and (a,b)
. The number is growing exponentially with 2^n - 1
. For your 10 nullable columns, to cover all possible combinations of NULL values, you'd already need 1023 partial indexes. No go.
The simple solution: replace NULL values and define involved columns NOT NULL
, and everything would work just fine with a simple UNIQUE
constraint.
If that's not an option I suggest an expression index with COALESCE
to replace NULL in the index:
CREATE UNIQUE INDEX test_upsert_solution_idx
ON test_upsert (name, status, COALESCE(test_field, ''));
The empty string (''
) is an obvious candidate for character types, but you can use any legal value that either never appears or can be folded with NULL according to your definition of "unique".
Then use this statement:
INSERT INTO test_upsert as tu(name,status,test_field,identifier, count)
VALUES ('shaun', 1, null , 'ident', 11) -- works with
, ('bob' , 2, 'test value', 'ident', 22) -- and without NULL
ON CONFLICT (name, status, COALESCE(test_field, '')) DO UPDATE -- match expr. index
SET count = COALESCE(tu.count + EXCLUDED.count, EXCLUDED.count, tu.count);
Like @ypercube I assume you actually want to add count
to the existing count. Since the column can be NULL, adding NULL would set the column NULL. If you define count NOT NULL
, you can simplify.
Another idea would be to just drop the conflict_target from the statement to cover all unique violations. Then you could define various unique indexes for a more sophisticated definition of what's supposed to be "unique". But that won't fly with ON CONFLICT DO UPDATE
. The manual once more:
For ON CONFLICT DO NOTHING
, it is optional to specify a
conflict_target; when omitted, conflicts with all usable constraints
(and unique indexes) are handled. For ON CONFLICT DO UPDATE
, a
conflict_target must be provided.
Best Answer
A corrupted index would be the prime suspect here. Test with:
The expression
name || ''
cannot use any indexes, so you get a sequential scan. If that query finds your entry, you have your diagnosis: corrupted index. Probably the one on just(name)
, but since multiple indexes qualify, recheck withEXPLAIN
. (It's probably not the PK as that one still throws a unique violation in your test, but that one may be corrupted, too ...)There is a note in the release notes for Postgres 11.11 in particular:
Or maybe locales have been updated in your underlying OS? Same fix: reindex.
There are other reasons for index corruption, but the only other common one is hardware issues. That should trigger more drastic measures immediately, starting with a backup.
Recreate affected indexes. You can use
REINDEX
:If you need to allow concurrent access to the table use the non-blocking (but slower)
CONCURRENTLY
:If there is reason to believe the problem might be systemic, recreate all indexes on the table:
Or the whole the database:
In case of a bigger cleanup, I would advise a maintenance window without concurrent access. And plenty of maintenance_work_mem.
Aside, you have these indexes for the two columns
name
andprocessed_name
:The PK index on
(name, processed_name)
can be used for everything that the additional index on just(name)
can be used. That additional index only might be useful ifprocessed_name
is a rather large column - in which case I would consider a more efficient PK to begin with. See: