PostgreSQL Optimization – Query Planning for Many-to-Many Select and Sort

optimizationperformancepostgresqlpostgresql-performancequery

I have a postgres database with a Genre table (40 records), a Book table (~1.3M records) and a many-many table to relate the two (~2M records). I want to select books by a particular Genre and then order them by some criteria and take the top n results. Some genres have just a few hundred books while others have several hundred thousand books. The query planner seems to have two ways of doing the query.

  1. Sorts the books by the criteria with an index and then takes the top n results with the particular genre

or

  1. Gets books with the particular genre and then do an external sort on the criteria.

1 works great if the genre has many books but is very slow if the genre only has a few hundred books. 2 works great if the genre has few books but bad if the genre has many books.

Here are the plans for the four cases. The "Fantasy" genre has 234090 records and the "Legal" genre has 110.

Popular genre with index:

explain analyze SELECT * FROM "book" AS b0 LEFT OUTER JOIN "book_genres" AS b1 ON b0."id" = b1."book_id" LEFT OUTER JOIN "genre" AS g2 ON g2."id" = b1."genre_id" WHERE (g2."name" = 'Fantasy') ORDER BY external_rating LIMIT 20 OFFSET 0;
                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.85..395.86 rows=20 width=714) (actual time=0.063..1.556 rows=20 loops=1)
   ->  Nested Loop  (cost=0.85..1013531.82 rows=51317 width=714) (actual time=0.062..1.551 rows=20 loops=1)
         Join Filter: (b1.genre_id = g2.id)
         Rows Removed by Join Filter: 125
         ->  Nested Loop  (cost=0.85..982740.17 rows=2052676 width=683) (actual time=0.022..1.455 rows=145 loops=1)
               ->  Index Scan using book_external_rating_index on book b0  (cost=0.43..268391.77 rows=1321634 width=667) (actual time=0.010..0.976 rows=100 loops=1)
               ->  Index Scan using book_genres_book_id_index on book_genres b1  (cost=0.43..0.52 rows=2 width=16) (actual time=0.003..0.003 rows=1 loops=100)
                     Index Cond: (book_id = b0.id)
         ->  Materialize  (cost=0.00..1.50 rows=1 width=31) (actual time=0.000..0.000 rows=1 loops=145)
               ->  Seq Scan on genre g2  (cost=0.00..1.50 rows=1 width=31) (actual time=0.008..0.013 rows=1 loops=1)
                     Filter: ((name)::text = 'Comedy'::text)
                     Rows Removed by Filter: 39
 Planning time: 0.858 ms
 Execution time: 1.637 ms
(14 rows)

Popular genre without index:

explain analyze SELECT * FROM "book" AS b0 LEFT OUTER JOIN "book_genres" AS b1 ON b0."id" = b1."book_id" LEFT OUTER JOIN "genre" AS g2 ON g2."id" = b1."genre_id" WHERE (g2."name" = 'Fantasy') ORDER BY external_rating+1 LIMIT 20 OFFSET 0;
                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=47531.73..47531.78 rows=20 width=714) (actual time=760.453..760.456 rows=20 loops=1)
   ->  Sort  (cost=47531.73..47660.03 rows=51317 width=714) (actual time=760.452..760.454 rows=20 loops=1)
         Sort Key: ((b0.external_rating + '1'::double precision))
         Sort Method: top-N heapsort  Memory: 33kB
         ->  Nested Loop  (cost=962.56..46166.21 rows=51317 width=714) (actual time=16.396..631.173 rows=234090 loops=1)
               ->  Nested Loop  (cost=962.13..15652.27 rows=51317 width=47) (actual time=16.378..98.655 rows=234090 loops=1)
                     ->  Seq Scan on genre g2  (cost=0.00..1.50 rows=1 width=31) (actual time=0.004..0.009 rows=1 loops=1)
                           Filter: ((name)::text = 'Comedy'::text)
                           Rows Removed by Filter: 39
                     ->  Bitmap Heap Scan on book_genres b1  (cost=962.13..15137.60 rows=51317 width=16) (actual time=16.372..76.001 rows=234090 loops=1)
                           Recheck Cond: (genre_id = g2.id)
                           Heap Blocks: exact=13206
                           ->  Bitmap Index Scan on book_genres_genre_id_idx  (cost=0.00..949.31 rows=51317 width=0) (actual time=14.220..14.220 rows=234090 loops=1)
                                 Index Cond: (genre_id = g2.id)
               ->  Index Scan using book_pkey on book b0  (cost=0.43..0.58 rows=1 width=667) (actual time=0.002..0.002 rows=1 loops=234090)
                     Index Cond: (id = b1.book_id)
 Planning time: 0.379 ms
 Execution time: 760.503 ms
(18 rows)

Unpopular genre with index:

explain analyze SELECT * FROM "book" AS b0 LEFT OUTER JOIN "book_genres" AS b1 ON b0."id" = b1."book_id" LEFT OUTER JOIN "genre" AS g2 ON g2."id" = b1."genre_id" WHERE (g2."name" = 'Legal') ORDER BY external_rating LIMIT 20 OFFSET 0;
                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.85..395.86 rows=20 width=714) (actual time=51.500..291.649 rows=20 loops=1)
   ->  Nested Loop  (cost=0.85..1013531.82 rows=51317 width=714) (actual time=51.499..291.645 rows=20 loops=1)
         Join Filter: (b1.genre_id = g2.id)
         Rows Removed by Join Filter: 179869
         ->  Nested Loop  (cost=0.85..982740.17 rows=2052676 width=683) (actual time=0.085..255.516 rows=179889 loops=1)
               ->  Index Scan using book_external_rating_index on book b0  (cost=0.43..268391.77 rows=1321634 width=667) (actual time=0.052..41.514 rows=125220 loops=1)
               ->  Index Scan using book_genres_book_id_index on book_genres b1  (cost=0.43..0.52 rows=2 width=16) (actual time=0.001..0.001 rows=1 loops=125220)
                     Index Cond: (book_id = b0.id)
         ->  Materialize  (cost=0.00..1.50 rows=1 width=31) (actual time=0.000..0.000 rows=1 loops=179889)
               ->  Seq Scan on genre g2  (cost=0.00..1.50 rows=1 width=31) (actual time=0.011..0.015 rows=1 loops=1)
                     Filter: ((name)::text = 'Legal'::text)
                     Rows Removed by Filter: 39
 Planning time: 1.013 ms
 Execution time: 291.732 ms
(14 rows)

Unpopular genre without index:

explain analyze SELECT * FROM "book" AS b0 LEFT OUTER JOIN "book_genres" AS b1 ON b0."id" = b1."book_id" LEFT OUTER JOIN "genre" AS g2 ON g2."id" = b1."genre_id" WHERE (g2."name" = 'Legal') ORDER BY external_rating+1 LIMIT 20 OFFSET 0;
                                                                           QUERY PLAN                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=47531.73..47531.78 rows=20 width=714) (actual time=2.124..2.132 rows=20 loops=1)
   ->  Sort  (cost=47531.73..47660.03 rows=51317 width=714) (actual time=2.123..2.130 rows=20 loops=1)
         Sort Key: ((b0.external_rating + '1'::double precision))
         Sort Method: top-N heapsort  Memory: 47kB
         ->  Nested Loop  (cost=962.56..46166.21 rows=51317 width=714) (actual time=0.089..1.918 rows=111 loops=1)
               ->  Nested Loop  (cost=962.13..15652.27 rows=51317 width=47) (actual time=0.074..0.496 rows=111 loops=1)
                     ->  Seq Scan on genre g2  (cost=0.00..1.50 rows=1 width=31) (actual time=0.010..0.014 rows=1 loops=1)
                           Filter: ((name)::text = 'Legal'::text)
                           Rows Removed by Filter: 39
                     ->  Bitmap Heap Scan on book_genres b1  (cost=962.13..15137.60 rows=51317 width=16) (actual time=0.061..0.449 rows=111 loops=1)
                           Recheck Cond: (genre_id = g2.id)
                           Heap Blocks: exact=88
                           ->  Bitmap Index Scan on book_genres_genre_id_idx  (cost=0.00..949.31 rows=51317 width=0) (actual time=0.041..0.041 rows=111 loops=1)
                                 Index Cond: (genre_id = g2.id)
               ->  Index Scan using book_pkey on book b0  (cost=0.43..0.58 rows=1 width=667) (actual time=0.011..0.011 rows=1 loops=111)
                     Index Cond: (id = b1.book_id)
 Planning time: 0.914 ms
 Execution time: 2.231 ms
(18 rows)

I have PostgreSQL 9.5.13

The query planner seems to just make the decision based on whether there is an index for the sort criteria, so it ends up making the wrong choice often. Assuming we have an index, is there a way to ensure the optimizer makes the right choice?

Best Answer

I don't think there is a clean way to do this in existing versions of PostgreSQL, and nothing in the works to do this cleanly for near-future versions either.

What it comes down to is that it would have to first execute the seq scan on genre to get g2.id, and then use that value to re-plan the query with that specific g2.id as the b1.genre_id. That way it could consult the statistics on b1.genre_id to predict how many rows it will find and change the plan accordingly. Not only does it have to know how to do this multi-step process, but it also has to prove that it is correct (what if g2.name is not unique? What if g2.id or g2.name gets updated between the 2 queries?) it has to be able to decide when this multi-step process is worth doing versus the current one-shot optimization.

Your best bet is probably to query g2.id based on g2.name yourself as a pre-step, then drop the join to g2 from your main query and just hardcode the discovered value of g2.id as the thing that b1.genre_id is compared against.