A few thoughts. Consider range partitioning on "timestamp", it will reduce the amount of work your query have to do. Further optimization might be to calculate and store the agg for "closed" partitions. You will of course have to recalculate this when you modify historical information.
As a bonus it will be much easier to roll-out historical data that is no longer needed.
As mentioned this is just some thoughts and reflections, it may or it may not fit your situation. One pitfall may be the number of different intervals that you use for reporting. If larger intervals is not multiples of the partitioning interval this wont work well.
Edit: How is the plan affected by creating an temp table and an index on "timestamp"?
create index X on market_trades ( timestamp );
create temp table T ( time_range tsrange );
insert into T ( time_range )
WITH vals AS (
SELECT '2013-08-19 0:00'::timestamp AS frame_start,
'2013-08-26 0:00'::timestamp AS frame_end,
'17h'::interval AS interval_length
), intervals AS (
SELECT tsrange(start_time,
lead(start_time, 1, frame_end) OVER (ORDER BY start_time NULLS FIRST)) AS time_range
FROM (
SELECT generate_series(frame_start, frame_end, interval_length) AS
start_time, frame_end
FROM vals
) _
WHERE start_time < frame_end
)
SELECT time_range
FROM intervals i;
create index x on T ( time_range );
analyze t;
SELECT time_range, count(td.id) AS agg
FROM T
LEFT JOIN market_trades td
ON td.timestamp <@ T.time_range
GROUP BY T.time_range
ORDER BY T.time_range;
You can use the +
operator.
SELECT pk,ev_date,ev FROM events;
pk | ev_date | ev
----+------------+----------
1 | 2016-02-19 | 01:00:00
2 | 2016-02-19 | 02:00:00
3 | 2016-02-19 | 05:00:00
4 | 2016-02-19 | 12:00:00
5 | 2016-02-19 | 18:00:00
6 | 2016-02-19 | 23:00:00
7 | 2016-02-20 | 01:00:00
8 | 2016-02-20 | 05:00:00
9 | 2016-02-20 | 12:00:00
10 | 2016-02-20 | 18:00:00
(10 rows)
SELECT pk, ev_date, ev
FROM events
WHERE (ev_date + ev)
BETWEEN ('2016-02-19 04:00:00')
AND ('2016-02-20 02:00:00');
pk | ev_date | ev
----+------------+----------
3 | 2016-02-19 | 05:00:00
4 | 2016-02-19 | 12:00:00
5 | 2016-02-19 | 18:00:00
6 | 2016-02-19 | 23:00:00
7 | 2016-02-20 | 01:00:00
(5 rows)
Don't forget to create the index below:
CREATE INDEX events_ts_idx ON events ((ev_date + ev));
ANALYZE events;
I've inserted many dummy rows, so I show the result of EXPLAIN:
EXPLAIN ANALYZE SELECT pk, ev_date, ev FROM events WHERE (ev_date + ev)
BETWEEN ('2016-02-19 23:50:00')
AND ('2016-02-20 00:01:00');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using events_ts_idx on events (cost=0.29..8.52 rows=8 width=16) (actual time=0.014..0.029 rows=42 loops=1)
Index Cond: (((ev_date + ev) >= '2016-02-19 23:50:00'::timestamp without time zone) AND ((ev_date + ev) <= '2016-02-20 00:01:00'::timestamp without time zone))
Planning time: 0.082 ms
Execution time: 0.053 ms
(4 rows)
For comparison, I've created other index and tried other form:
CREATE INDEX events_ts2_idx ON events (ev_date,ev);
ANALYZE events;
EXPLAIN ANALYZE SELECT pk, ev_date, ev FROM events WHERE (ev_date,ev)
BETWEEN ('2016-02-19','23:50:00')
AND ('2016-02-20','0:01:00');
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on events (cost=189.50..511.36 rows=7143 width=16) (actual time=0.027..0.042 rows=42 loops=1)
Recheck Cond: ((ROW(ev_date, ev) >=ROW('2016-02-19'::date,'23:50:00'::time without time zone)) AND (ROW(ev_date, ev) <= ROW('2016-02-20'::date, '00:01:00'::time without time zone)))
Heap Blocks: exact=7
-> Bitmap Index Scan on events_ts2_idx (cost=0.00..187.72rows=7143 width=0) (actual time=0.019..0.019 rows=42 loops=1)
Index Cond: ((ROW(ev_date, ev) >= ROW('2016-02-19'::date,'23:50:00'::time without time zone))AND(ROW(ev_date, ev) <= ROW('2016-02-20'::date, '00:01:00'::time without time zone)))
Planning time: 0.079 ms
Execution time: 0.071 ms
(7 rows)
According to my investigation, my way (using +
operator) is better. I recommend to compare with both ways on your machine.
Best Answer
A couple of suggestions:
Use legal, lower-case, unquoted identifiers to save yourself a lot of confusion.
end
(and, to a lesser degreestart
) are reserved words.Your
id
column seems to be numeric type likeinteger
, nottext
.Since you don't need a time component in
start_day
andend_day
, the appropriate data type isdate
, nottimestamp
.Why concatenate
id
and the year for a new "ID"? Add a second columnyear
if you need it for a PK. Or don't add an additional column at all. It can cheaply be extracted from the newstart_day
on the fly. Generally, don't store data redundantly if you can avoid it.Typically,
timestamp
ranges use an inclusive lower bound and an exclusive upper bound. Since timestamps can have fractional digits (up to 6 in Postgres) that is much cleaner. Your input2003-12-31 23:59:59
would fail for2003-12-31 23:59:59.123
.So, your table could look like this:
Proper test values:
Solution
Use
generate_series()
in aLATERAL
join, based on start and end day, truncated to the year withdate_trunc()
. This produces one row per year with the new start date. Add a year and you have the new end date. Except for first and last row perid
, where you substitute the proper start / end withGREATEST
andLEAST
respectively. Voilá.Note that
date_trunc()
returnstimestamptz
fortimestamptz
input andtimestamp
fortimestamp
input. Fordate
input it defaults totimestamptz
. Since you seem to be ignoring time zones, cast thedate
totimestamp
explicitly (start_day::timestamp
).Result:
If you create a new table from the result, I suggest
(id, year)
as primary key.Aside: This is not an operator:
||'_'||
(nor a cute little face). It's 2 concatenation operators||
and a string literal'_'
.