Postgresql – How to index when joining on a table with conditions

indexjoin;postgresqlselect

I have a query like this:

SELECT "main_table".*
FROM "main_table"
INNER JOIN "other_table" ON "other_table"."deleted_at" IS NULL
AND "other_table"."id" = "main_table"."appointment_id"
WHERE "main_table"."user_id" = 1
  AND (other_table.date BETWEEN '2020-02-26 23:29:42.678693' AND '2020-02-27 01:29:42.678739')

You probably know everything you need to about the schema from the query, but let me know any questions.

What are the optimal indexes to make for this?

main_table is easy (right?)

CREATE INDEX ON main_table (user_id);

For other_table, I'm tempted to think "first postgres will 'do the join', then it will 'filter by date'". Which suggests this index:

CREATE INDEX ON other_table (id, date) where deleted_at IS NULL;

But, without creating the indexes and running EXPLAIN, is it unknowable how postgres will go about doing the query, and it might think that it's better to filter by date first, then 'do the join'?

  • this select will typically match 1–10 rows
  • for a given date range, there are thousands of rows in other_table
  • for a given user_id there are thousands of rows in main_table
  • the results of this query will never have two main_table rows correlating with one other_table row, or vice versa. It's always 1-1.
  • other_table has about 2/3 of its rows with deleted_at not null

Best Answer

Given your information about the distribution of the data, the following execution plan will be the best:

  • Use an index scan to get the 1 to 10 rows from other table.

  • Perform a nested loop join with main_table as inner table, which will be fast if the join condition is indexed.

So the ideal index on other_table would be

CREATE INDEX ON other_table (date) WHERE deleted_at IS NULL;

If deleted_at is usually NULL, you can omit the WHERE clause without losing much.

If you want to get an index only scan on other_table (which probably isn't necessary, since you only fetch very few rows), you could instead

CREATE INDEX ON other_table (date) INCLUDING (id) WHERE deleted_at IS NULL;

The ideal index on main_table would be

CREATE INDEX ON main_table (user_id, appointment_id);

The order of the columns doesn't matter in this case, because you will scan for both columns with the = operator.


In the real world, you will try to pick indexes that can be useful for as many queries as possible, because having too many indexes hurts performance and wastes space.