Postgresql – Postgres datetime sort slow and not using index

index-tuningperformancepostgresqlpostgresql-performancesorting

Using Postgres 9.4, I'm trying to order a query by an indexed datetime column.

explain 
SELECT g.*, array_agg(gp.platform_id) as platform_list,
            array_agg(gm.metacritic_id) as metacritic_list,
            array_agg(gr.review_id) as review_list,
            array_agg(gn.genre_id) as genre_list,
            array_agg(gp.id) as release_list 
FROM gb_game g 
LEFT JOIN gb_release gp ON g.id = gp.game_id 
LEFT JOIN gb_game_metacritic gm ON g.id = gm.game_id 
LEFT JOIN gb_game_reviews gr ON g.id = gr.game_id 
LEFT JOIN gb_game_genre gn ON g.id = gn.game_id 
GROUP BY g.id 
ORDER BY g.release_date 
DESC limit 25 offset 0

Limit  (cost=66391.46..66391.52 rows=25 width=294)
  ->  Sort  (cost=66391.46..66511.58 rows=48051 width=294)
        Sort Key: g.release_date
        ->  GroupAggregate  (cost=1.56..65035.49 rows=48051 width=294)
              Group Key: g.id
              ->  Merge Left Join  (cost=1.56..53579.05 rows=691686 width=294)
                    Merge Cond: (g.id = gn.game_id)
                    ->  Merge Left Join  (cost=1.14..16374.78 rows=63464 width=290)
                          Merge Cond: (g.id = gp.game_id)
                          ->  Merge Left Join  (cost=0.85..10856.60 rows=48051 width=282)
                                Merge Cond: (g.id = gr.game_id)
                                ->  Merge Left Join  (cost=0.58..10676.85 rows=48051 width=278)
                                      Merge Cond: (g.id = gm.game_id)
                                      ->  Index Scan using platforms_game_pkey on gb_game g  (cost=0.29..8984.87 rows=48051 width=274)
                                      ->  Index Scan using gb_game_metacritic_6072f8b3 on gb_game_metacritic gm  (cost=0.29..1264.75 rows=24569 width=8)
                                ->  Index Scan using gb_game_reviews_6072f8b3 on gb_game_reviews gr  (cost=0.28..51.04 rows=686 width=8)
                          ->  Index Scan using platforms_release_6072f8b3 on gb_release gp  (cost=0.29..4604.75 rows=63464 width=12)
                    ->  Materialize  (cost=0.42..27979.58 rows=523702 width=8)
                          ->  Index Scan using gb_game_genre_6072f8b3 on gb_game_genre gn  (cost=0.42..26670.33 rows=523702 width=8)

The release_date column doesn't have many repeats other than null.
Everything is fast until the final sort.

Is there a special kind of index I need to use to speed up this query?

Best Answer

Better query

Seems like a perfect occasion for a couple of LATERAL joins, which should be much faster here:

SELECT g.*, p.platform_list, m.metacritic_list
     , r.review_list, n.genre_list, p.release_list 
FROM  (
   SELECT *
   FROM   gb_game
   ORDER  BY release_date DESC NULLS LAST  -- !? see below
   LIMIT  25
   OFFSET 0
   ) g
,      LATERAL (
   SELECT array_agg(platform_id) AS platform_list
        , array_agg(id)          AS release_list 
   FROM   gb_release
   WHERE  game_id = g.id
   ) p
,      LATERAL (
   SELECT ARRAY (
      SELECT metacritic_id
      FROM   gb_game_metacritic
      WHERE  game_id = g.id)     AS metacritic_list
   ) m
,      LATERAL (
   SELECT ARRAY (
      SELECT review_id
      FROM   gb_game_reviews
      WHERE  game_id = g.id)     AS review_list
   ) r
,      LATERAL (
   SELECT ARRAY (
      SELECT genre_id
      FROM   gb_game_genre
      WHERE  game_id = g.id)     AS genre_list
   ) n
ORDER   BY g.release_date DESC NULLS LAST;  -- repeat inner sort order

Indexes

You need an index on each game_id column in the child tables.
Multicolumn indexes might be even better if (and only if) you can get index-only scans out of them:

gb_release (game_id, id, platform_id)  -- special: 2 additional columns
gb_game_metacritic (game_id, metacritic_id)
gb_game_reviews (game_id, review_id)
gb_game_genre (game_id, genre_id)

The order of columns matters.

And finally, you need an index on gb_game (release_date).
There is no data type called datetime in Postgres. Assuming you mean timestamp (some ORMs map this way).

Either way, if you really mean to ORDER BY g.release_date DESC, a plain index on (release_date DESC) or just (release_date) does the job

If the column can be NULL, I suspect you actually want to sort NULL values last. Adapt your query like I did to ORDER BY g.release_date DESC NULLS LAST. A plain index on (release_date) still works, but is not the optimum. Use a matching sort order in your index:

gb_game (release_date DESC NULLS LAST)

Or (almost equivalent):

gb_game (release_date ASC NULLS FIRST)

More:

Explain query

You join to multiple tables that can have n related rows for a single row in gb_game, resulting rows in the intermediary derived table are multiplied with every join to multiple rows. That's not only a major performance issue, it's also incorrect. Or at least I assume as much while detailed information is missing. Related:

I avoid the problem a priori by aggregating each child table separately in LATERAL subqueries. That seems particularly efficient, since you only pick 25 rows total.

While being at it I use a faster ARRAY constructor for single column aggregation (while I still use array_agg() for the case where two columns from the same table are aggregated separately.

Note that the order of elements in each array is arbitrary. You can add an ORDER BY clause in each lateral subquery to determine the order.

The last three lateral subqueries could be replaced with a correlated subqueries: