I have a 2 tables – listings
and users
.
CREATE TABLE users (
id text PRIMARY KEY,
internal_id serial,
)
CREATE TABLE listings (
id bigserial PRIMARY KEY,
created_at timestamptz default current_timestamp,
posted_by text REFERENCES users ON UPDATE CASCADE ON DELETE CASCADE,
)
When I want to select a listing, I want to first query by listing.id because there is a b index on it and it will give the fastest look up. Then I want to filter out by user_id. In addition, join the users table so I can have both the listing record and the user.internal_id from posted_by.
This query works:
SELECT
*, users.internal_id
FROM
listings
JOIN users ON listings.posted_by = users.id
WHERE
listings.id = 2
and posted_by = '1009280753550922';
but I do not think it is querying by listing.id
, which is indexed, for best performance:
dev_dolphin_db=# explain SELECT *, users.internal_id FROM listings JOIN users ON listings.posted_by = users.id WHERE listings.id = 2 and posted_by = '1009280753550922';
Nested Loop (cost=0.14..10.46 rows=1 width=1560)
-> Seq Scan on listings (cost=0.00..2.29 rows=1 width=840)
Filter: ((posted_by = '1009280753550922'::text) AND (id = 2))
-> Index Scan using users_pkey on users (cost=0.14..8.16 rows=1 width=720)
Index Cond: (id = '1009280753550922'::text)
How could I build a query that would explicitly query on the indexed listings.id
first?
Best Answer
If you look carefully at your query plan, you will find that the
Seq Scan
expects to find just one row (rows=1 width=840
) in thelistings
table.This is what happens if you have a situation such as this:
We add a few users, but just ONE listing:
Have the tables analyzed, so that the query planner has got good statistics:
And make it explain the SELECT:
dbfiddle here
The reason why PostgreSQL decides to perform a Sequential Scan of the
listings
table is because there is just one row. In that case, reading the whole table takes less time than using an index.If you just add more data to the
listings
table, PostgreSQL changes plans, because then, reading the whole table is no longer the cheapest alternative:dbfiddle here
PostgreSQL (and most other databases) has a Query Planner that chooses an execution plan based on a cost-estimate of all available alternatives. The cost-estimate is based on the amount and statistical distribution of data within the different columns of the different tables, and takes into consideration the availability of indexes, but will choose to use them only if it is the cheapest alternative (or you force it to use them via settings).
With a query as simple as the one in this example, just joining two tables using an equality JOIN condition, you shouldn't need to outsmart PostgreSQL query planner. Let the planner plan. It will nearly always choose the best plan.
You may need to play some tricks when JOINs are much more complicated: many more tables, conditions which aren't necessarily just equalities, function calls, correlations between tables that the planner doesn't take into account, etc. ... and the planner ends up making bad cost-estimates, and the estimated cheapest plan is not really the cheapest.