PostgreSQL – query first on indexed column

postgresql

I have a 2 tables – listings and users.

CREATE TABLE users (
  id text PRIMARY KEY,
  internal_id serial,
)

CREATE TABLE listings (
  id bigserial PRIMARY KEY,
  created_at timestamptz default current_timestamp,
  posted_by text REFERENCES users ON UPDATE CASCADE ON DELETE CASCADE,
)

When I want to select a listing, I want to first query by listing.id because there is a b index on it and it will give the fastest look up. Then I want to filter out by user_id. In addition, join the users table so I can have both the listing record and the user.internal_id from posted_by.

This query works:

SELECT 
    *, users.internal_id 
FROM 
    listings 
    JOIN users ON listings.posted_by = users.id 
WHERE 
    listings.id = 2 
    and posted_by = '1009280753550922';

but I do not think it is querying by listing.id, which is indexed, for best performance:

dev_dolphin_db=# explain SELECT *, users.internal_id FROM listings JOIN users ON listings.posted_by = users.id WHERE listings.id = 2 and posted_by = '1009280753550922';
Nested Loop  (cost=0.14..10.46 rows=1 width=1560)
   ->  Seq Scan on listings  (cost=0.00..2.29 rows=1 width=840)
         Filter: ((posted_by = '1009280753550922'::text) AND (id = 2))
   ->  Index Scan using users_pkey on users  (cost=0.14..8.16 rows=1 width=720)
         Index Cond: (id = '1009280753550922'::text)

How could I build a query that would explicitly query on the indexed listings.id first?

Best Answer

If you look carefully at your query plan, you will find that the Seq Scan expects to find just one row (rows=1 width=840) in the listings table.

This is what happens if you have a situation such as this:

 CREATE TABLE users 
 (
   id text PRIMARY KEY,
   internal_id serial,
   something_else char(800)
 ) ;

 CREATE TABLE listings 
 (
   id bigserial PRIMARY KEY,
   created_at timestamptz default current_timestamp,
   posted_by text REFERENCES users ON UPDATE CASCADE ON DELETE CASCADE
 ) ;

We add a few users, but just ONE listing:

 INSERT INTO users (id) VALUES('1009280753550922') ;
 INSERT INTO users (id) SELECT generate_series(1, 1000)::text ;

 INSERT INTO listings (posted_by) VALUES('1009280753550922') ;

Have the tables analyzed, so that the query planner has got good statistics:

 ANALYZE users;
 ANALYZE listings;

And make it explain the SELECT:

 EXPLAIN
 SELECT 
     *, users.internal_id 
 FROM 
     listings 
     JOIN users ON listings.posted_by = users.id 
 WHERE 
     listings.id = 2 
     and posted_by = '1009280753550922';

 | QUERY PLAN                                                                      |
 | :------------------------------------------------------------------------------ |
 | Nested Loop  (cost=0.28..9.32 rows=1 width=3248)                                |
 |   ->  Seq Scan on listings  (cost=0.00..1.01 rows=1 width=33)                   |
 |         Filter: ((posted_by = '1009280753550922'::text) AND (id = 2))           |
 |   ->  Index Scan using users_pkey on users  (cost=0.28..8.29 rows=1 width=3211) |
 |         Index Cond: (id = '1009280753550922'::text)                             |
 

dbfiddle here

The reason why PostgreSQL decides to perform a Sequential Scan of the listings table is because there is just one row. In that case, reading the whole table takes less time than using an index.

If you just add more data to the listings table, PostgreSQL changes plans, because then, reading the whole table is no longer the cheapest alternative:

 INSERT INTO listings (posted_by) SELECT generate_series(1, 1000)::text ;
 INSERT INTO listings (posted_by) SELECT generate_series(1, 1000)::text ;
 INSERT INTO listings (posted_by) SELECT generate_series(1, 1000)::text ;

 EXPLAIN
 SELECT 
     *, users.internal_id 
 FROM 
     listings 
     JOIN users ON listings.posted_by = users.id 
 WHERE 
     listings.id = 2 
     and posted_by = '1009280753550922';

 | QUERY PLAN                                                                          |
 | :---------------------------------------------------------------------------------- |
 | Nested Loop  (cost=0.56..16.60 rows=1 width=3234)                                   |
 |   ->  Index Scan using listings_pkey on listings  (cost=0.28..8.30 rows=1 width=19) |
 |         Index Cond: (id = 2)                                                        |
 |         Filter: (posted_by = '1009280753550922'::text)                              |
 |   ->  Index Scan using users_pkey on users  (cost=0.28..8.29 rows=1 width=3211)     |
 |         Index Cond: (id = '1009280753550922'::text)                                 |
 

dbfiddle here

PostgreSQL (and most other databases) has a Query Planner that chooses an execution plan based on a cost-estimate of all available alternatives. The cost-estimate is based on the amount and statistical distribution of data within the different columns of the different tables, and takes into consideration the availability of indexes, but will choose to use them only if it is the cheapest alternative (or you force it to use them via settings).


With a query as simple as the one in this example, just joining two tables using an equality JOIN condition, you shouldn't need to outsmart PostgreSQL query planner. Let the planner plan. It will nearly always choose the best plan.

You may need to play some tricks when JOINs are much more complicated: many more tables, conditions which aren't necessarily just equalities, function calls, correlations between tables that the planner doesn't take into account, etc. ... and the planner ends up making bad cost-estimates, and the estimated cheapest plan is not really the cheapest.