I have a time series table prices
in a PostgreSQL 10 DB.
Here is a simplified test case to illustrate the problem:
CREATE TABLE prices (
currency text NOT NULL,
side boolean NOT NULL,
price numeric NOT NULL,
ts timestamptz NOT NULL
);
I want to quickly query the last values of each currency
/side
duo, as this would give me the current buy/sell price of each currency.
My current solution is:
create index on prices (currency, side, ts desc);
select distinct on (currency, side) *
order by currency, side, ts desc;
But this will give me very slow queries (~500ms) in this table with only ~30k rows.
The actual table has four columns that I want to group, instead of two. Here is what the actual table and query really looks like:
create table prices (
exchange integer not null,
pair text not null,
side boolean not null,
guaranteed_volume numeric not null,
ts timestamp with time zone not null,
price numeric not null,
constraint prices_pkey primary key (exchange, pair, side, guaranteed_volume, ts),
constraint prices_exchange_fkey foreign key (exchange)
references exchanges (id) match simple
on update no action
on delete no action
);
create index prices_exchange_pair_side_guaranteed_volume_ts_idx
on prices (exchange, pair, side, guaranteed_volume, ts desc);
create view last_prices as
select distinct on (exchange, pair, side, guaranteed_volume)
exchange
, pair
, side
, guaranteed_volume
, price
, ts
from prices
order by exchange
, pair
, side
, guaranteed_volume
, ts desc;
There are 34441 rows, currently. Some useful debug queries:
# explain (analyze,buffers) select * from last_prices;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=2662.03..2997.71 rows=1224 width=37) (actual time=403.218..459.041 rows=392 loops=1)
Buffers: shared hit=418
-> Sort (cost=2662.03..2729.17 rows=26854 width=37) (actual time=403.213..411.041 rows=28353 loops=1)
Sort Key: prices.exchange, prices.pair, prices.side, prices.guaranteed_volume, prices.ts DESC
Sort Method: quicksort Memory: 2984kB
Buffers: shared hit=418
-> Seq Scan on prices (cost=0.00..686.54 rows=26854 width=37) (actual time=0.022..31.407 rows=28353 loops=1)
Buffers: shared hit=418
Planning time: 0.911 ms
Execution time: 460.190 ms
Explain analyze with seqscan disabled:
# explain (analyze,buffers) select * from last_prices;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.41..4458.07 rows=1224 width=37) (actual time=0.037..122.237 rows=392 loops=1)
Buffers: shared hit=15182
-> Index Scan using prices_exchange_pair_side_guaranteed_volume_ts_idx on prices (cost=0.41..4189.53 rows=26854 width=37) (actual time=0.034..91.237 rows=29649 loops=1)
Buffers: shared hit=15182
Planning time: 0.291 ms
Execution time: 122.417 ms
Adding a query with the view's query being accessed directly:
# explain (analyze, buffers)
select distinct on (exchange, pair, side, guaranteed_volume)
exchange
, pair
, side
, guaranteed_volume
, price
, ts
from prices
order by exchange
, pair
, side
, guaranteed_volume
, ts desc;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=2163.56..2429.99 rows=1224 width=37) (actual time=364.716..391.405 rows=380 loops=1)
Buffers: shared hit=418
-> Sort (cost=2163.56..2216.85 rows=21314 width=37) (actual time=364.711..370.458 rows=24011 loops=1)
Sort Key: exchange, pair, side, guaranteed_volume, ts DESC
Sort Method: quicksort Memory: 2644kB
Buffers: shared hit=418
-> Seq Scan on prices (cost=0.00..631.14 rows=21314 width=37) (actual time=0.025..13.751 rows=24011 loops=1)
Buffers: shared hit=418
Planning time: 0.258 ms
Execution time: 392.110 ms
Best Answer
DISTINCT ON
excels for few rows per combination of interest. But your use case obviously has many rows per distinct(currency, side)
. SoDISTINCT ON
is a bad choice as far as performance is concerned. You'll find a detailed assessment and an arsenal of solutions in these two related answers on SO:If all you need is the latest timestamp
ts
, the column is sort criteria and desired return value in one and the case is very simple. Look to Evan's simple solution withmax(ts)
.(Well, ideally, you'd have an index on
(currency, side, ts desc NULLS LAST)
, sincemax(ts)
ignores NULL values and better matches this sort order. But that won't matter much with a column definedNOT NULL
.)Typically, you need additional columns from each selected row (like the current price!) and/or you need to sort by multiple columns, so you need to do more.
Ideally, you have another table listing all currencies - and a FK constraint to enforce referential integrity and disallow nonexistent currencies. Then use the query technique from chapter "2a. LATERAL join" in the linked answer, expanded to account for the added
side
:Based on your initial simple test case:
You should see very fast index scans on an index on
(currency, side, ts DESC)
.If index-only scans are possible and you only need
ts
andprice
it might pay to addprice
as last column to the index.dbfiddle here
Whether you save this query in a
VIEW
or not doesn't affect performance.