PostgreSQL – How to Optimize Fetching Top N Rows with Group By

postgresqlpostgresql-performance

I have a big table with these fields:

CREATE UNLOGGED TABLE mytable (
  ts NUMERIC(16,6) NOT NULL,
  value INTEGER NOT NULL,
  fieldA INTEGER NOT NULL,
  fieldB INTEGER NOT NULL,
  ...
);

I am trying to find those values of fieldA which have the highest sum of value over a give time interval (1 hour, 6 hours or 1 day). In other words, for a given interval I would like to get top 10 sums of value per fieldA, plus a matching value of this field.

Query says more than 1000 words:

  SELECT
    fieldA, sum(value) "aggr"
  FROM
    mytable
  WHERE
    ts >= 1234567890 AND
    ts < 1234567890 + 24 * 3600
  GROUP BY
    fieldA
  ORDER BY
    aggr DESC
  LIMIT 10;

Result:

 fieldA        |  aggr   
---------------+------------
 140           | 3147666070
 127           | 2647653771
 182           | 1247401380
 3             | 1247372688
 68            | 1246742329
 227           | 1246433376
 54            | 1246100364
 243           | 1245971364
 191           | 1245912876
 62            | 1245818815
(10 rows)

This query returns correct results, but I am having trouble optimizing it. Note that I am running a similar query for fieldA, fieldB,… (around 5 fields) once per hour, once every 6 hours and once per day, with time intervals changed appropriately. When running it for 24h, I see spikes in server load which cause other queries to become unacceptably slow (it is true however that I'm running multiple such queries in parallel – will fix that).

There is currently around 30 million new records per day and I would like to make it capable of more, running on a single server (this is the reason for UNLOGGED – in this case I don't mind losing data on crash). Not sure if it matters: while I am currently simply dropping and re-creating the table every 5 – 10 days, in the future I will change it to use partitions so that I can remove data older than ~3 days.

This is how the query plan looks like for 24h when there is around 31 hours worth of data (~40M rows) in the table:

# explain analyze SELECT fieldA, sum(value) "aggr" FROM mytable WHERE ts >= 1000000000 AND ts < 1000086400 GROUP BY fieldA ORDER BY aggr DESC LIMIT 10;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=739891.81..739891.84 rows=10 width=15) (actual time=16343.876..16348.483 rows=10 loops=1)
   ->  Sort  (cost=739891.81..739892.45 rows=253 width=15) (actual time=15876.302..15876.304 rows=10 loops=1)
         Sort Key: (sum(value)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Finalize GroupAggregate  (cost=739822.25..739886.35 rows=253 width=15) (actual time=15875.512..15876.200 rows=253 loops=1)
               Group Key: fieldA
               ->  Gather Merge  (cost=739822.25..739881.29 rows=506 width=15) (actual time=15875.494..15880.493 rows=759 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Sort  (cost=738822.23..738822.86 rows=253 width=15) (actual time=15836.782..15836.810 rows=253 loops=3)
                           Sort Key: fieldA
                           Sort Method: quicksort  Memory: 36kB
                           Worker 0:  Sort Method: quicksort  Memory: 36kB
                           Worker 1:  Sort Method: quicksort  Memory: 36kB
                           ->  Partial HashAggregate  (cost=738809.60..738812.13 rows=253 width=15) (actual time=15836.524..15836.599 rows=253 loops=3)
                                 Group Key: fieldA
                                 ->  Parallel Seq Scan on mytable  (cost=0.00..669906.49 rows=13780622 width=11) (actual time=271.628..12076.394 rows=10439990 loops=3)
                                       Filter: ((ts >= '1000000000'::numeric) AND (ts < '1000086400'::numeric))
                                       Rows Removed by Filter: 3045010
 Planning Time: 0.227 ms
 JIT:
   Functions: 31
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 18.778 ms, Inlining 251.228 ms, Optimization 665.295 ms, Emission 365.024 ms, Total 1300.324 ms
 Execution Time: 16357.350 ms
(25 rows)

Is there some way I can optimize these kinds of queries? Even creating an index on ts didn't seem to help with this query (index wasn't used – I assume because most of the rows were used anyway).

Alternatively, can I organize data differently? I thought about aggregating it manually as it comes and just updating aggregated values, but it seems to me this would mean even more work for the database. I would appreciate some ideas.

Best Answer

When running it for 24h, I see spikes in server load which cause other queries to become unacceptably slow (it is true however that I'm running multiple such queries in parallel - will fix that).

Yeah, fix that and see if you still have a problem. Also, disable parallelization within individuals queries as well, by setting max_parallel_workers_per_gather = 0 for these queries. Also, consider looking at those other queries being made slow, to see if they can be improved. That might be easier than improving these queries.

Alternatively, can I organize data differently? I thought about aggregating it manually as it comes and just updating aggregated values, but it seems to me this would mean even more work for the database.

It seems to me that the problem is not with the total amount of work, but rather that the distribution of the work over time is awkward. So I think this idea has a lot of merit.

Also, you could store the aggregates hourly (without the LIMIT), then just reaggregate those aggregates to higher time levels. This depends on columnA, columnB, etc. each having many fewer distinct values than the total dataset has rows.