PostgreSQL – Optimize Query with Multiple Subqueries

performancepostgresqlquery-performance

I'm trying to optimize this query taking between 3-4 minutes (it's running on hdd right now and it will go as low as 10s on sdd when going to production I guess but it's still very high).
Table candles contains 50 millions rows.

CREATE TABLE candles (
    timestamp BIGINT NOT NULL,
    instrument_id INTEGER NOT NULL,
    notional_usd BIGINT NOT NULL,
    CONSTRAINT candles_pk PRIMARY KEY (timestamp, instrument_id)
);

Note : timestamp cannot be changed from BIGINT to TIMESTAMP for compatibility issue

Query :

EXPLAIN ANALYSE
SELECT
    extract(hour from to_timestamp(timestamp/1000)) as "time",
    extract(hour from to_timestamp(timestamp/1000))::text as metric,
    SUM(notional_usd)/t.max_notional_usd*100 AS notional_usd_percent
    FROM candles,
    (
        SELECT 
          MAX(sumed.notional_usd) AS max_notional_usd
        FROM (
          SELECT
            extract(hour from to_timestamp(timestamp/1000)) as "time",
            SUM(notional_usd) AS notional_usd
          FROM candles
          GROUP BY "time"
        ) AS sumed
    ) AS t
GROUP BY "time", t.max_notional_usd

Result :

"GroupAggregate  (cost=15600527.41..16047921.77 rows=220656 width=104) (actual time=185056.051..227426.091 rows=24 loops=1)"
"  Group Key: (date_part('hour'::text, to_timestamp(((candles."timestamp" / 1000))::double precision))), (max((sum(candles_1.notional_usd))))"
"  ->  Sort  (cost=15600527.41..15710031.53 rows=43801648 width=56) (actual time=183186.442..205920.751 rows=43801648 loops=1)"
"        Sort Key: (date_part('hour'::text, to_timestamp(((candles."timestamp" / 1000))::double precision))), (max((sum(candles_1.notional_usd))))"
"        Sort Method: external merge  Disk: 2143000kB"
"        ->  Nested Loop  (cost=3815240.76..5549731.22 rows=43801648 width=56) (actual time=31596.456..86215.017 rows=43801648 loops=1)"
"              ->  Aggregate  (cost=3815240.76..3815240.77 rows=1 width=32) (actual time=31596.429..31596.430 rows=1 loops=1)"
"                    ->  Finalize GroupAggregate  (cost=3611424.70..3812482.56 rows=220656 width=40) (actual time=20496.568..31596.383 rows=24 loops=1)"
"                          Group Key: (date_part('hour'::text, to_timestamp(((candles_1."timestamp" / 1000))::double precision)))"
"                          ->  Gather Merge  (cost=3611424.70..3804207.96 rows=441312 width=40) (actual time=20022.571..31596.259 rows=72 loops=1)"
"                                Workers Planned: 2"
"                                Workers Launched: 2"
"                                ->  Partial GroupAggregate  (cost=3610424.67..3752269.58 rows=220656 width=40) (actual time=19835.282..30720.177 rows=24 loops=3)"
"                                      Group Key: (date_part('hour'::text, to_timestamp(((candles_1."timestamp" / 1000))::double precision)))"
"                                      ->  Sort  (cost=3610424.67..3656051.39 rows=18250687 width=16) (actual time=19370.253..25047.717 rows=14600549 loops=3)"
"                                            Sort Key: (date_part('hour'::text, to_timestamp(((candles_1."timestamp" / 1000))::double precision)))"
"                                            Sort Method: external merge  Disk: 371464kB"
"                                            Worker 0:  Sort Method: external merge  Disk: 377288kB"
"                                            Worker 1:  Sort Method: external merge  Disk: 365720kB"
"                                            ->  Parallel Seq Scan on candles candles_1  (cost=0.00..785454.74 rows=18250687 width=16) (actual time=0.029..8526.498 rows=14600549 loops=3)"
"              ->  Seq Scan on candles  (cost=0.00..858457.48 rows=43801648 width=16) (actual time=0.017..16963.830 rows=43801648 loops=1)"
"Planning Time: 0.154 ms"
"Execution Time: 228151.276 ms"

How could I optimize this query ?
This query 'only' takes 37s without the embeded selects. What I basically want is to divide SUM(notional_usd) for each hour by the max value. Is there a way I could do this without the embeded select ?

Best Answer

You can avoid the second scan of the large table using window functions and a subquery:

SELECT "time",
       metric,
       notional_usd / max(notional_usd) OVER () * 100 AS notional_usd_percent
FROM (SELECT extract (hour FROM to_timestamp(timestamp/1000)) AS "time",
             extract (hour FROM to_timestamp(timestamp/1000))::text AS metric,
             SUM(notional_usd) AS notional_usd
      FROM candles
      GROUP BY "time") AS q;

You will still have to perform the GROUP BY, so raising work_mem for this query is a good idea.