I'm trying to optimize this query taking between 3-4 minutes (it's running on hdd right now and it will go as low as 10s on sdd when going to production I guess but it's still very high).
Table candles
contains 50 millions rows.
CREATE TABLE candles (
timestamp BIGINT NOT NULL,
instrument_id INTEGER NOT NULL,
notional_usd BIGINT NOT NULL,
CONSTRAINT candles_pk PRIMARY KEY (timestamp, instrument_id)
);
Note : timestamp cannot be changed from BIGINT
to TIMESTAMP
for compatibility issue
Query :
EXPLAIN ANALYSE
SELECT
extract(hour from to_timestamp(timestamp/1000)) as "time",
extract(hour from to_timestamp(timestamp/1000))::text as metric,
SUM(notional_usd)/t.max_notional_usd*100 AS notional_usd_percent
FROM candles,
(
SELECT
MAX(sumed.notional_usd) AS max_notional_usd
FROM (
SELECT
extract(hour from to_timestamp(timestamp/1000)) as "time",
SUM(notional_usd) AS notional_usd
FROM candles
GROUP BY "time"
) AS sumed
) AS t
GROUP BY "time", t.max_notional_usd
Result :
"GroupAggregate (cost=15600527.41..16047921.77 rows=220656 width=104) (actual time=185056.051..227426.091 rows=24 loops=1)"
" Group Key: (date_part('hour'::text, to_timestamp(((candles."timestamp" / 1000))::double precision))), (max((sum(candles_1.notional_usd))))"
" -> Sort (cost=15600527.41..15710031.53 rows=43801648 width=56) (actual time=183186.442..205920.751 rows=43801648 loops=1)"
" Sort Key: (date_part('hour'::text, to_timestamp(((candles."timestamp" / 1000))::double precision))), (max((sum(candles_1.notional_usd))))"
" Sort Method: external merge Disk: 2143000kB"
" -> Nested Loop (cost=3815240.76..5549731.22 rows=43801648 width=56) (actual time=31596.456..86215.017 rows=43801648 loops=1)"
" -> Aggregate (cost=3815240.76..3815240.77 rows=1 width=32) (actual time=31596.429..31596.430 rows=1 loops=1)"
" -> Finalize GroupAggregate (cost=3611424.70..3812482.56 rows=220656 width=40) (actual time=20496.568..31596.383 rows=24 loops=1)"
" Group Key: (date_part('hour'::text, to_timestamp(((candles_1."timestamp" / 1000))::double precision)))"
" -> Gather Merge (cost=3611424.70..3804207.96 rows=441312 width=40) (actual time=20022.571..31596.259 rows=72 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Partial GroupAggregate (cost=3610424.67..3752269.58 rows=220656 width=40) (actual time=19835.282..30720.177 rows=24 loops=3)"
" Group Key: (date_part('hour'::text, to_timestamp(((candles_1."timestamp" / 1000))::double precision)))"
" -> Sort (cost=3610424.67..3656051.39 rows=18250687 width=16) (actual time=19370.253..25047.717 rows=14600549 loops=3)"
" Sort Key: (date_part('hour'::text, to_timestamp(((candles_1."timestamp" / 1000))::double precision)))"
" Sort Method: external merge Disk: 371464kB"
" Worker 0: Sort Method: external merge Disk: 377288kB"
" Worker 1: Sort Method: external merge Disk: 365720kB"
" -> Parallel Seq Scan on candles candles_1 (cost=0.00..785454.74 rows=18250687 width=16) (actual time=0.029..8526.498 rows=14600549 loops=3)"
" -> Seq Scan on candles (cost=0.00..858457.48 rows=43801648 width=16) (actual time=0.017..16963.830 rows=43801648 loops=1)"
"Planning Time: 0.154 ms"
"Execution Time: 228151.276 ms"
How could I optimize this query ?
This query 'only' takes 37s without the embeded selects. What I basically want is to divide SUM(notional_usd)
for each hour by the max value. Is there a way I could do this without the embeded select ?
Best Answer
You can avoid the second scan of the large table using window functions and a subquery:
You will still have to perform the
GROUP BY
, so raisingwork_mem
for this query is a good idea.