PostgreSQL – Performance of DISTINCT vs. GROUP BY

I'm trying to select the most recent authors of messages in a room, using this (simplified) table:

             Table "public.message"
 Column  |  Type   | Nullable |               Default               
---------+---------+----------+-------------------------------------
 id      | bigint  | not null | nextval('message_id_seq'::regclass)
 room    | integer | not null | 
 author  | integer | not null | 
 created | integer | not null | 
Indexes:
    "message_pkey" PRIMARY KEY, btree (id)
    "message_author_created_room" btree (author, created, room)
    "message_room_author_created" btree (room, author, created)
    "message_room_created" btree (room, created)
    "message_room_id" btree (room, id)

The problem is a query like this is slow:

select message.author as id, max(message.created) as mc from message
where room=12 group by message.author order by mc desc limit 50;

And here's the explain(analyze, verbose, buffers):

miaou=> explain (analyze, verbose, buffers) select message.author as id, max(message.created) as mc from message
where room=12 group by message.author order by mc desc limit 50;
                                                                                             QUERY PLAN                                                                                             
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10627.14..10627.26 rows=50 width=8) (actual time=54.887..54.901 rows=50 loops=1)
   Output: author, (max(created))
   Buffers: shared hit=490
   ->  Sort  (cost=10627.14..10629.19 rows=820 width=8) (actual time=54.885..54.891 rows=50 loops=1)
         Output: author, (max(created))
         Sort Key: (max(message.created)) DESC
         Sort Method: top-N heapsort  Memory: 29kB
         Buffers: shared hit=490
         ->  Finalize GroupAggregate  (cost=1000.46..10599.90 rows=820 width=8) (actual time=14.019..54.788 rows=160 loops=1)
               Output: author, max(created)
               Group Key: message.author
               Buffers: shared hit=490
               ->  Gather Merge  (cost=1000.46..10583.50 rows=1640 width=8) (actual time=14.007..54.636 rows=248 loops=1)
                     Output: author, (PARTIAL max(created))
                     Workers Planned: 2
                     Workers Launched: 2
                     Buffers: shared hit=490
                     ->  Partial GroupAggregate  (cost=0.43..9394.18 rows=820 width=8) (actual time=3.439..34.733 rows=83 loops=3)
                           Output: author, PARTIAL max(created)
                           Group Key: message.author
                           Buffers: shared hit=2989
                           Worker 0: actual time=0.297..49.593 rows=116 loops=1
                             Buffers: shared hit=1550
                           Worker 1: actual time=6.624..40.612 rows=60 loops=1
                             Buffers: shared hit=949
                           ->  Parallel Index Only Scan using message_room_author_created on public.message  (cost=0.43..8904.09 rows=96377 width=8) (actual time=0.030..20.067 rows=73907 loops=3)
                                 Output: author, created
                                 Index Cond: (message.room = 12)
                                 Heap Fetches: 139
                                 Buffers: shared hit=2989
                                 Worker 0: actual time=0.035..28.355 rows=109834 loops=1
                                   Buffers: shared hit=1550
                                 Worker 1: actual time=0.030..23.723 rows=79112 loops=1
                                   Buffers: shared hit=949
 Planning time: 0.211 ms
 Execution time: 57.071 ms

I'd like to know how to make that faster, the real important goal being to get the N most recent authors. Is there a faster way to query that information?

Best Answer

select count(author) from message where room=12;

Because author can not be null just do SELECT count(*). Doing so may result in this being an index-only scan on either message_room_created or message_room_id. If you need the unique authors, which you don't have you'll have to use count( DISTINCT author ).

Also this being planned as a parallel query. That's relatively new functionality. You may want to try SET max_parallel_workers_per_gather = 0 to disable parallel query functionality and reposting the results.

Either way

this one is slow (same duration, about 45ms)

If you've got millions rows an index scan and heap fetches, it's not really fair to call 45ms "slow."

Best Answer

Related Solutions

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question