You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
PostgreSQL certainly can use an index for IS NOT NULL
. I don't see any query planner assumptions about that condition, either.
If the null fraction for the column (pg_statistic.stanullfrac
) is low enough to suggest that the index is usefully selective for the query, PostgreSQL will use an index.
I can't figure out what you're trying to say with:
If this is correct, is my understanding that an index on a column defined as "NOT NULL" not be used in a query which uses that column?
Certainly an index won't get used for an IS NOT NULL
condition on a NOT NULL
column. It'd always match 100% of rows, so a seqscan will almost always be much faster.
PostgreSQL won't use an index if the index doesn't filter out a large proportion of rows for a query. The only likely exception is when you're asking for a set of columns covered by a single index, in an order matching that of the index. PostgreSQL might do an index-only scan then. E.g. if there's an index on t(a, b, c)
and you:
select a, b FROM t ORDER BY a, b, c;
PostgreSQL might use your index, even though no rows are filtered out, because it only has to read the index and can skip reading the heap, avoid doing a sort, etc.
Best Answer
If additional preconditions for an index-only scan are met, it makes perfect sense to append the column
id
as trailing column to the index (not as leading column):Postgres 11 introduces actual covering indexes with the
INCLUDE
keyword.Only a small benefit for your case, but it's a great option to add columns to a UNIQUE or PK index or constraint.
About index-only scans:
The most important precondition: The visibility map of table
thing_types
has to show most or all pages as "visible" to all transactions. I.e. the table is either read-only, or your autovacuum settings are aggressive enough to continuously clean up after writes to the table.Every additional index adds costs. Mostly to write performance. But also side effects, like exhausted cache capacities. (Multiple queries using the same indexes have a better chance for them to reside in cache.) So it's also a question of size.
id
is typically a very small columninteger
orbigint
. Makes it a good candidate for the use case.In particular, adding a column to an index disables the option for H.O.T. updates involving the column. But since
id
is indexed anyway and typically not updated (being the PK) this is not a problem in this case. Related:If you actually get index-only scans out of these indexes most of the time, it typically makes sense to use them. Test with
EXPLAIN
.There were limitations for partial indexes in older versions. Quoting the release notes of Postgres 9.6: