Postgresql – Can Postgres use an index-only scan for this query with joined tables

indexindex-tuningoptimizationpostgresql

This is a follow-up to: Do covering indexes in PostgreSQL help JOIN columns?

Consider the inverse of the schema in the other question where you filter in the joined-on table:

CREATE TABLE thing_types(
   id              INTEGER PRIMARY KEY
 , first_lvl_type  TEXT
 , second_lvl_type TEXT
);

CREATE TABLE things(
   id         INTEGER PRIMARY KEY
 , thing_type INTEGER REFERENCES thing_types(id)
 , t1c1       INTEGER
);

And a query like so:

SELECT things.t1c1
FROM   things
JOIN   thing_types ON things.thing_type = thing_types.id
WHERE  thing_types.first_lvl_type = 'Book'
AND    thing_types.second_lvl_type = 'Biography';

Is it madness to have an index like:

CREATE INDEX ON thing_types(first_lvl_type, second_lvl_type, id);

which covers the primary key for use in that join? Will the index be used as a covering index to help the JOIN in the above query? Should I change my indexing strategy to cover the primary key more often when I know the table is going to be JOINed on like this?

Best Answer

If additional preconditions for an index-only scan are met, it makes perfect sense to append the column id as trailing column to the index (not as leading column):

CREATE INDEX ON thing_types(first_lvl_type, second_lvl_type, id);

Postgres 11 introduces actual covering indexes with the INCLUDE keyword.

CREATE INDEX ON thing_types(first_lvl_type, second_lvl_type) INCLUDE (id);

Only a small benefit for your case, but it's a great option to add columns to a UNIQUE or PK index or constraint.

About index-only scans:

The most important precondition: The visibility map of table thing_types has to show most or all pages as "visible" to all transactions. I.e. the table is either read-only, or your autovacuum settings are aggressive enough to continuously clean up after writes to the table.

Every additional index adds costs. Mostly to write performance. But also side effects, like exhausted cache capacities. (Multiple queries using the same indexes have a better chance for them to reside in cache.) So it's also a question of size. id is typically a very small column integer or bigint. Makes it a good candidate for the use case.

In particular, adding a column to an index disables the option for H.O.T. updates involving the column. But since id is indexed anyway and typically not updated (being the PK) this is not a problem in this case. Related:

Redundant data in update statements

If you actually get index-only scans out of these indexes most of the time, it typically makes sense to use them. Test with EXPLAIN.

There were limitations for partial indexes in older versions. Quoting the release notes of Postgres 9.6:

Allow use of an index-only scan on a partial index when the index's WHERE clause references columns that are not indexed (Tomas Vondra, Kyotaro Horiguchi)

For example, an index defined by CREATE INDEX tidx_partial ON t(b) WHERE a > 0 can now be used for an index-only scan by a query that specifies WHERE a > 0 and does not otherwise use a. Previously this was disallowed because a is not listed as an index column.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Can PostgreSQL use nulls in its indexes

PostgreSQL certainly can use an index for IS NOT NULL. I don't see any query planner assumptions about that condition, either.

If the null fraction for the column (pg_statistic.stanullfrac) is low enough to suggest that the index is usefully selective for the query, PostgreSQL will use an index.

I can't figure out what you're trying to say with:

If this is correct, is my understanding that an index on a column defined as "NOT NULL" not be used in a query which uses that column?

Certainly an index won't get used for an IS NOT NULL condition on a NOT NULL column. It'd always match 100% of rows, so a seqscan will almost always be much faster.

PostgreSQL won't use an index if the index doesn't filter out a large proportion of rows for a query. The only likely exception is when you're asking for a set of columns covered by a single index, in an order matching that of the index. PostgreSQL might do an index-only scan then. E.g. if there's an index on t(a, b, c) and you:

select a, b FROM t ORDER BY a, b, c;

PostgreSQL might use your index, even though no rows are filtered out, because it only has to read the index and can skip reading the heap, avoid doing a sort, etc.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Can PostgreSQL use nulls in its indexes

Related Question