I have a big table with these fields:
CREATE UNLOGGED TABLE mytable (
ts NUMERIC(16,6) NOT NULL,
value INTEGER NOT NULL,
fieldA INTEGER NOT NULL,
fieldB INTEGER NOT NULL,
...
);
I am trying to find those values of fieldA
which have the highest sum of value
over a give time interval (1 hour, 6 hours or 1 day). In other words, for a given interval I would like to get top 10 sums of value
per fieldA
, plus a matching value of this field.
Query says more than 1000 words:
SELECT
fieldA, sum(value) "aggr"
FROM
mytable
WHERE
ts >= 1234567890 AND
ts < 1234567890 + 24 * 3600
GROUP BY
fieldA
ORDER BY
aggr DESC
LIMIT 10;
Result:
fieldA | aggr
---------------+------------
140 | 3147666070
127 | 2647653771
182 | 1247401380
3 | 1247372688
68 | 1246742329
227 | 1246433376
54 | 1246100364
243 | 1245971364
191 | 1245912876
62 | 1245818815
(10 rows)
This query returns correct results, but I am having trouble optimizing it. Note that I am running a similar query for fieldA
, fieldB
,… (around 5 fields) once per hour, once every 6 hours and once per day, with time intervals changed appropriately. When running it for 24h
, I see spikes in server load which cause other queries to become unacceptably slow (it is true however that I'm running multiple such queries in parallel – will fix that).
There is currently around 30 million new records per day and I would like to make it capable of more, running on a single server (this is the reason for UNLOGGED
– in this case I don't mind losing data on crash). Not sure if it matters: while I am currently simply dropping and re-creating the table every 5 – 10 days, in the future I will change it to use partitions so that I can remove data older than ~3 days.
This is how the query plan looks like for 24h when there is around 31 hours worth of data (~40M rows) in the table:
# explain analyze SELECT fieldA, sum(value) "aggr" FROM mytable WHERE ts >= 1000000000 AND ts < 1000086400 GROUP BY fieldA ORDER BY aggr DESC LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=739891.81..739891.84 rows=10 width=15) (actual time=16343.876..16348.483 rows=10 loops=1)
-> Sort (cost=739891.81..739892.45 rows=253 width=15) (actual time=15876.302..15876.304 rows=10 loops=1)
Sort Key: (sum(value)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> Finalize GroupAggregate (cost=739822.25..739886.35 rows=253 width=15) (actual time=15875.512..15876.200 rows=253 loops=1)
Group Key: fieldA
-> Gather Merge (cost=739822.25..739881.29 rows=506 width=15) (actual time=15875.494..15880.493 rows=759 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=738822.23..738822.86 rows=253 width=15) (actual time=15836.782..15836.810 rows=253 loops=3)
Sort Key: fieldA
Sort Method: quicksort Memory: 36kB
Worker 0: Sort Method: quicksort Memory: 36kB
Worker 1: Sort Method: quicksort Memory: 36kB
-> Partial HashAggregate (cost=738809.60..738812.13 rows=253 width=15) (actual time=15836.524..15836.599 rows=253 loops=3)
Group Key: fieldA
-> Parallel Seq Scan on mytable (cost=0.00..669906.49 rows=13780622 width=11) (actual time=271.628..12076.394 rows=10439990 loops=3)
Filter: ((ts >= '1000000000'::numeric) AND (ts < '1000086400'::numeric))
Rows Removed by Filter: 3045010
Planning Time: 0.227 ms
JIT:
Functions: 31
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 18.778 ms, Inlining 251.228 ms, Optimization 665.295 ms, Emission 365.024 ms, Total 1300.324 ms
Execution Time: 16357.350 ms
(25 rows)
Is there some way I can optimize these kinds of queries? Even creating an index on ts
didn't seem to help with this query (index wasn't used – I assume because most of the rows were used anyway).
Alternatively, can I organize data differently? I thought about aggregating it manually as it comes and just updating aggregated values, but it seems to me this would mean even more work for the database. I would appreciate some ideas.
Best Answer
Yeah, fix that and see if you still have a problem. Also, disable parallelization within individuals queries as well, by setting max_parallel_workers_per_gather = 0 for these queries. Also, consider looking at those other queries being made slow, to see if they can be improved. That might be easier than improving these queries.
It seems to me that the problem is not with the total amount of work, but rather that the distribution of the work over time is awkward. So I think this idea has a lot of merit.
Also, you could store the aggregates hourly (without the LIMIT), then just reaggregate those aggregates to higher time levels. This depends on columnA, columnB, etc. each having many fewer distinct values than the total dataset has rows.