Postgresql – Slow Sum/Group By over large amount of rows

performancepostgresqlpostgresql-performance

Please can you help with a slow query that we have with a large amount of data (~500K rows). We're on Postgres 10.

SELECT mike_romeo, SUM(hotel) as counts
FROM whiskey_lima
WHERE
    yankee = 'kilo'
AND
    whiskey_india = 'a_value'
AND
    echo = 'romeo_november'
GROUP BY
    mike_romeo;

And my table looks like this:

    Column     |            Type             | Collation | Nullable |                        Default
---------------+-----------------------------+-----------+----------+--------------------------------------------------------
 id            | bigint                      |           | not null | nextval('id_seq'::regclass)
 yankee        | mike_hotel.quebec           |           | not null |
 whiskey_india | character varying           |           | not null |
 echo          | mike_hotel.romeo_yankee     |           | not null |
 mike_romeo    | character varying           |           | not null |
 oscar         | timestamp without time zone |           | not null |
 hotel         | integer                     |           | not null |
 papa          | timestamp without time zone |           |          |
Indexes:
    "whiskey_lima_pkey" PRIMARY KEY, btree (id)
    "index_unique" UNIQUE, btree (echo, mike_romeo, yankee, whiskey_india, oscar)
    "index_query" btree (yankee, whiskey_india, echo)
    "index_on_oscar" btree (oscar)

Our EXPLAIN (ANALYSE, BUFFERS) looks like this: https://explain.depesz.com/s/4lU

GroupAggregate  (cost=77294.010..78398.740 rows=38882 width=69) (actual time=56343.468..59194.302 rows=527807 loops=1)
    Group Key: mike_romeo
    Buffers: shared hit=61428 read=68664, temp read=23413 written=23450
  ->  Sort  (cost=77294.010..77532.650 rows=95455 width=65) (actual time=56343.461..58848.379 rows=530929 loops=1)
          Sort Key: mike_romeo
          Sort Method: external merge  Disk: 111488kB
          Buffers: shared hit=61428 read=68664, temp read=23413 written=23450
        ->  Index Scan using mike_quebec on whiskey_lima  (cost=0.560..67104.720 rows=95455 width=65) (actual time=1.647..38814.509 rows=530929 loops=1)
                Index Cond: ((yankee = 'kilo'::mike_hotel.quebec) AND ((whiskey_india)::text = 'juliet'::text) AND (echo = 'romeo_november'::mike_hotel.romeo_yankee))
                Buffers: shared hit=61425 read=68664

Questions

What should we focus on – are there any obvious solutions?

Can we avoid the sort?

Why has the query planner underestimated the rows – could this be causing some slowness? We're trying to enable extended statistics

We're looking into increasing work_mem so that the disk sort is mitigated – would this make the whole query performant?

What else am I missing?

Thanks a lot for your help

Best Answer

I would try with an index like:

 echo, yankee, whiskey_india, mike_romeo

Perhaps you can replace your index_unique with:

 "index_unique" UNIQUE, btree (echo, yankee, whiskey_india, mike_romeo, oscar)  

The idea is to put the predicates first in then index, followed by group by columns.