Postgresql – How to index when joining on a table with conditions

indexjoin;postgresqlselect

I have a query like this:

SELECT "main_table".*
FROM "main_table"
INNER JOIN "other_table" ON "other_table"."deleted_at" IS NULL
AND "other_table"."id" = "main_table"."appointment_id"
WHERE "main_table"."user_id" = 1
  AND (other_table.date BETWEEN '2020-02-26 23:29:42.678693' AND '2020-02-27 01:29:42.678739')

You probably know everything you need to about the schema from the query, but let me know any questions.

What are the optimal indexes to make for this?

main_table is easy (right?)

CREATE INDEX ON main_table (user_id);

For other_table, I'm tempted to think "first postgres will 'do the join', then it will 'filter by date'". Which suggests this index:

CREATE INDEX ON other_table (id, date) where deleted_at IS NULL;

But, without creating the indexes and running EXPLAIN, is it unknowable how postgres will go about doing the query, and it might think that it's better to filter by date first, then 'do the join'?

this select will typically match 1–10 rows
for a given date range, there are thousands of rows in other_table
for a given user_id there are thousands of rows in main_table
the results of this query will never have two main_table rows correlating with one other_table row, or vice versa. It's always 1-1.
other_table has about 2/3 of its rows with deleted_at not null

Best Answer

Given your information about the distribution of the data, the following execution plan will be the best:

Use an index scan to get the 1 to 10 rows from other table.
Perform a nested loop join with main_table as inner table, which will be fast if the join condition is indexed.

So the ideal index on other_table would be

CREATE INDEX ON other_table (date) WHERE deleted_at IS NULL;

If deleted_at is usually NULL, you can omit the WHERE clause without losing much.

If you want to get an index only scan on other_table (which probably isn't necessary, since you only fetch very few rows), you could instead

CREATE INDEX ON other_table (date) INCLUDING (id) WHERE deleted_at IS NULL;

The ideal index on main_table would be

CREATE INDEX ON main_table (user_id, appointment_id);

The order of the columns doesn't matter in this case, because you will scan for both columns with the = operator.

In the real world, you will try to pick indexes that can be useful for as many queries as possible, because having too many indexes hurts performance and wastes space.

Addition by @Erwin

^{As per request in comment}

In principal (default) b-tree indexes can be scanned forward and backward at the same speed. But sorting can make a difference in multi-column indexes where you combine the sort order of multiple columns. The query starts with:

SELECT  DISTINCT ON (songs.rank, songs.shared_id)

In combination with ORDER BY rank DESC this dictates that the result be ordered by rank DESC, shared_id effectively. After the (simplified) WHERE clause WHERE processed AND working has been applied and before LIMIT can be applied.
I have my doubts if the DISTINCT clause is actually useful. But while it is there, the optimal index for the query should be (just as @ypercube suspected):

CREATE INDEX songs_special_idx
ON songs (processed, working, rank DESC, shared_id);

Looks like one of the rare cases where explicit ordering of index columns would benefit the query. There is an excellent explanation in the chapter Indexes and ORDER BY of the manual.

If the WHERE condition is stable (always WHERE processed AND working), a partial multi-column index would be smaller and faster, yet:

CREATE INDEX songs_special_idx
ON songs (rank DESC, shared_id)
WHERE processed AND working;

PostgreSQL Index Tuning – Compound Index on Large Tables

In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

In short: put the most selective columns first.

Your EXPLAIN output shows that the condition on pid is more selective (rows=7836) than the one on outline (rows=63112). If that can be generalized (a single example may be misleading) I suggest this alternative:

CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);

If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.

Table layout

This is an educated guess since I don't have complete information.

Don't use oid as column name. It's easy to confuse with the OID.
Don't use the name date for a timestamp column. Or rather: don't use the name date for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.
Create a lookup table for types and only put a small integer type_id into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.
I prefer the type text (or varchar without length limit) over varchar(n). Details.

For example:

CREATE TABLE portal.inventory (
   inventory_id bigint PRIMARY KEY
  ,type_id      integer NOT NULL REFERENCES inv_type(type_id)
  ,pid          integer NOT NULL
  ,size         bigint NOT NULL
  ,ts           timestamp NOT NULL
  ,outline      geography(Polygon,4326)
  ,product_name text
  ,path         text
);

Best Answer

Related Solutions

Postgresql – Really slow DISTINCT ON query with multiple joins

Addition by @Erwin

PostgreSQL Index Tuning – Compound Index on Large Tables

Table layout

Related Question