Postgresql – Faster joining on ‘most recent’ related rows

optimizationperformancepostgresqlquery-performance

I have a bug tracking application. The bug table stores very little,
because every state change to a bug happens by adding a comment. For
example, the current state of a bug is determined by the state value
of the most recent comment that changed state (comment.state IS NOT
NULL
). This requires some nasty queries that find the comment with the
highest id that changed the property in question.

This is what the schema looks like:

CREATE TABLE bug (
    id serial NOT NULL PRIMARY KEY,
    title character varying
);


CREATE TABLE comment (
    id serial NOT NULL PRIMARY KEY,
    state character varying,
    created_at timestamp without time zone,
    priority character varying,
    creator_id integer,
    bug_id integer references bug(id)
);

Here is the query for the index page, that shows open bugs ordered by most recently modified. Can it be made any faster?

SELECT bug.id AS id,
       bug.title AS title,
       comment_1.state AS state,
       comment_2.created_at AS created_at,
       comment_2.creator_id AS creator_id,
       comment_3.priority AS priority,
       comment_4.created_at AS last_modified,
       comment_4.creator_id AS last_modified_by 
FROM bug
JOIN
  (SELECT bug_id,
          MAX(id) AS id
   FROM comment
   WHERE state IS NOT NULL
   GROUP BY bug_id) AS max_state ON max_state.bug_id = bug.id
JOIN comment comment_1 ON comment_1.id = max_state.id    -- comment that changed the state
JOIN
  (SELECT bug_id,
          MIN(id) AS id
   FROM comment
   GROUP BY bug_id) AS min_created ON min_created.bug_id = bug.id
JOIN comment comment_2 ON comment_2.id = min_created.id  -- first comment
JOIN
  (SELECT bug_id,
          MAX(id) AS id
   FROM comment
   WHERE priority IS NOT NULL
   GROUP BY bug_id) AS max_priority ON max_priority.bug_id = bug.id
JOIN comment comment_3 ON comment_3.id = max_priority.id  -- comment that changed priority
JOIN
  (SELECT bug_id,
          MAX(id) AS id
   FROM comment
   GROUP BY bug_id) AS max_modified ON max_modified.bug_id = bug.id
JOIN comment comment_4 ON comment_4.id = max_modified.id  -- most recent comment
WHERE comment_1.state != 'closed'
ORDER BY last_modified;

The query plan:

Sort  (cost=195712503.78..196700370.73 rows=395146780 width=90) (actual time=356.245..356.278 rows=292 loops=1)
  Sort Key: comment_4.created_at
  Sort Method: quicksort  Memory: 72kB
  ->  Hash Join  (cost=15013.56..17735920.72 rows=395146780 width=90) (actual time=333.743..356.022 rows=292 loops=1)
        Hash Cond: ((max(public.comment.id)) = comment_4.id)
        ->  Hash Join  (cost=12793.54..13526.05 rows=395146780 width=86) (actual time=301.028..316.924 rows=292 loops=1)
              Hash Cond: (public.comment.bug_id = bug.id)
              ->  Hash Join  (cost=3552.21..4237.15 rows=8287 width=12) (actual time=54.202..68.236 rows=292 loops=1)
                    Hash Cond: ((max(public.comment.id)) = comment_1.id)
                    ->  HashAggregate  (cost=1384.50..1487.00 rows=10250 width=8) (actual time=28.765..32.867 rows=12418 loops=1)
                          ->  Seq Scan on comment  (cost=0.00..1118.34 rows=53232 width=8) (actual time=0.004..9.692 rows=53320 loops=1)
                                Filter: (state IS NOT NULL)
                    ->  Hash  (cost=1276.68..1276.68 rows=51203 width=12) (actual time=25.288..25.288 rows=41028 loops=1)
                          Buckets: 4096  Batches: 4  Memory Usage: 485kB
                          ->  Seq Scan on comment comment_1  (cost=0.00..1276.68 rows=51203 width=12) (actual time=0.008..13.326 rows=41028 loops=1)
                                Filter: ((state)::text <> 'closed'::text)
              ->  Hash  (cost=9210.42..9210.42 rows=2473 width=90) (actual time=246.798..246.798 rows=12418 loops=1)
                    Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 1025kB
                    ->  Nested Loop  (cost=5626.70..9210.42 rows=2473 width=90) (actual time=170.565..237.003 rows=12418 loops=1)
                          ->  Merge Join  (cost=5626.70..7964.40 rows=2473 width=86) (actual time=170.537..207.449 rows=12418 loops=1)
                                Merge Cond: (comment_3.id = (max(public.comment.id)))
                                ->  Index Scan using comment_pkey on comment comment_3  (cost=0.00..2142.27 rows=63334 width=10) (actual time=0.014..14.553 rows=63327 loops=1)
                                ->  Sort  (cost=5626.70..5632.88 rows=2473 width=84) (actual time=170.514..175.504 rows=12418 loops=1)
                                      Sort Key: (max(public.comment.id))
                                      Sort Method: external sort  Disk: 1216kB
                                      ->  Hash Join  (cost=5172.96..5487.32 rows=2473 width=84) (actual time=129.948..146.909 rows=12418 loops=1)
                                            Hash Cond: (public.comment.bug_id = bug.id)
                                            ->  HashAggregate  (cost=1435.01..1556.96 rows=12195 width=8) (actual time=30.173..34.362 rows=12420 loops=1)
                                                  ->  Seq Scan on comment  (cost=0.00..1118.34 rows=63334 width=8) (actual time=0.002..6.794 rows=63334 loops=1)
                                            ->  Hash  (cost=3706.46..3706.46 rows=2519 width=76) (actual time=99.742..99.742 rows=12418 loops=1)
                                                  Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 1025kB
                                                  ->  Hash Join  (cost=3391.64..3706.46 rows=2519 width=76) (actual time=74.910..92.291 rows=12418 loops=1)
                                                        Hash Cond: (public.comment.bug_id = bug.id)
                                                        ->  HashAggregate  (cost=1435.01..1556.96 rows=12195 width=8) (actual time=30.358..34.751 rows=12420 loops=1)
                                                              ->  Seq Scan on comment  (cost=0.00..1118.34 rows=63334 width=8) (actual time=0.003..7.033 rows=63334 loops=1)
                                                        ->  Hash  (cost=1924.57..1924.57 rows=2565 width=68) (actual time=44.522..44.522 rows=12418 loops=1)
                                                              Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 1025kB
                                                              ->  Merge Join  (cost=1381.46..1924.57 rows=2565 width=68) (actual time=23.613..36.985 rows=12418 loops=1)
                                                                    Merge Cond: (bug.id = public.comment.bug_id)
                                                                    ->  Index Scan using bug_pkey on bug  (cost=0.00..473.58 rows=12420 width=60) (actual time=0.014..4.511 rows=12420 loops=1)
                                                                    ->  Sort  (cost=1381.46..1387.88 rows=2565 width=8) (actual time=23.592..25.776 rows=12418 loops=1)
                                                                          Sort Key: public.comment.bug_id
                                                                          Sort Method: quicksort  Memory: 967kB
                                                                          ->  HashAggregate  (cost=1184.93..1210.58 rows=2565 width=8) (actual time=13.333..17.261 rows=12418 loops=1)
                                                                                ->  Seq Scan on comment  (cost=0.00..1118.34 rows=13317 width=8) (actual time=0.003..6.372 rows=13325 loops=1)
                                                                                      Filter: (priority IS NOT NULL)
                          ->  Index Scan using comment_pkey on comment comment_2  (cost=0.00..0.49 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=12418)
                                Index Cond: (id = (min(public.comment.id)))
        ->  Hash  (cost=1118.34..1118.34 rows=63334 width=12) (actual time=32.547..32.547 rows=63334 loops=1)
              Buckets: 4096  Batches: 4  Memory Usage: 686kB
              ->  Seq Scan on comment comment_4  (cost=0.00..1118.34 rows=63334 width=12) (actual time=0.012..14.321 rows=63334 loops=1)
Total runtime: 357.332 ms

Best Answer

I would suggest re-thinking your design a little bit.

You may want to have an active bool attribute to the comments, and a set of unique indexes to ensure that only the most recent comment of a certain type can be active. Then you can join using the flag instead of max and subqueries. Not only that but you get the benefit of a unique index across only the records you are interested in. I would also suggest a bug class id so that you can categorize the bugs that way for join conditions.

In that case your index definition would be something like:

CREATE UNIQUE INDEX bug_comment_class_id_idx_u ON comment(bug_class_id, bug_id) where active;

You can then:

CREATE INDEX comment_bug_id_idx ON comment(bug_id) where active;

And replace your multiple joins with a single join and a case statement.