Hi I am trying to optimise the a timestamp range contains <@
query for Postgres 12
I have done a bit of reading postgres documentation and discovered only GiST and SP-GiST indexes support this operator. However, I can't add one of these ( I tink I would need to add one to the heartrate table – see Schema below, but that is not a range type…).
My question is similar to this question and this one which also indicate I will need a GiST index. However, they are the other way around, eg the column they have a single timestamp and want to return from a table of tsrange
s all records that are contained. I have a table of timestamps and want to join it to a table of tsranges
For a bit of context for my schema, I have a collection of heartrates in the real dataset sampled ~1/3seconds, and a list of songs I have listen to, and when. I would like to query things like
avg(heartrate)
for a particulartrack
andartist
avg(heartrate)
for a particularartist
- etc.
Schema
create table heartrate ( "time" timestamp primary key , value float ) ; CREATE INDEX ON heartrate ("time", value); -- CREATE INDEX ON heartrate USING GIST ("time", value); can't do as "time" is not a range column. -- one gets the following error: --- ERROR: data type timestamp without time zone has no default operator class for access method "gist" Hint: You must specify an operator class for the index or define a default operator class for the data type. create table song_play( track TEXT NOT NULL, artist TEXT NOT NULL, play tsrange not null ) ; CREATE INDEX ON song_play(track, artist); INSERT INTO heartrate("time", value) SELECT d, 60+60*random() FROM generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '5 min'::interval) d ; INSERT INTO song_play(track,artist, play) SELECT case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end , case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end , tsrange(d, d+ (((random()*3+1)::text|| 'min')::interval)) FROM generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '1 day'::interval) d ; EXPLAIN SELECT sp.track, sp.artist, avg(h.value) FROM song_play sp left join heartrate h ON h.time <@ sp.play where sp.track='a' and sp.artist='b' GROUP BY sp.track, sp.artist;
Which results in the following:
✓ ✓ ✓ ✓ 525889 rows affected 1827 rows affected | QUERY PLAN | | :--------------------------------------------------------------------------------------------------------- | | GroupAggregate (cost=0.28..14689.24 rows=1 width=72) | | Group Key: sp.track, sp.artist | | -> Nested Loop Left Join (cost=0.28..14685.28 rows=526 width=72) | | Join Filter: (h."time" <@ sp.play) | | -> Index Scan using song_play_track_artist_idx on song_play sp (cost=0.28..8.29 rows=1 width=96) | | Index Cond: ((track = 'a'::text) AND (artist = 'b'::text)) | | -> Seq Scan on heartrate h (cost=0.00..8102.55 rows=525955 width=16) |
Note: the above plan results in A Full seq scan of the heartrate table, the largest table – not ideal at all!
I then decided to create the following function to see if it would help speed up queries. It converts a range eg tsrange('2020-01-01 00:00:00', '2020-01-02 00:00:00')
to a conditional query eg field >= 2020-01-01 00:00:00 and field < '2020-01-02 00:00:00'
.
essentially the same as the <@
contains operator.
And it seems to work! Although this is only helpful for looking up a particular song_play
's heartrate… not all of a track
/ artist
's song_play
's heartrates
CREATE OR REPLACE FUNCTION range_to_conditional(range anyrange, field text) RETURNS text LANGUAGE SQL IMMUTABLE STRICT AS $$ SELECT case when isempty(range) then 'false' when upper_inf(range) and lower_inf(range) then 'true' when upper_inf(range) then case when lower_inc(range) then format(' %L <= %I ', lower(range), field) else format(' %L < %I ', lower(range), field) end when lower_inf(range) then case when upper_inc(range) then format(' %L >= %I ', upper(range), field) else format(' %L > %I ', upper(range), field) end else case when lower_inc(range) and upper_inc(range) then format(' %1$L <= %3$I AND %2$L >= %3$I ', lower(range), upper(range), field) when lower_inc(range) then format(' %1$L <= %3$I AND %2$L > %3$I ', lower(range), upper(range), field) when upper_inc(range) then format(' %1$L < %3$I AND %2$L >= %3$I ', lower(range), upper(range), field) else format(' %1$L < %3$I AND %2$L > %3$I ', lower(range), upper(range), field) end end $$ ; create function avg_heartrate(sp song_play) returns double precision as $$ DECLARE retval double precision ; BEGIN EXECUTE format('select avg(h.value) from heartrate h where %s', range_to_conditional(sp.play, 'time')) INTO STRICT retval; RETURN retval; END $$ LANGUAGE plpgsql stable; SELECT sp.track, sp.artist, sp.play, avg_heartrate(sp) from song_play sp where sp.track='a' and sp.artist='b' limit 10;
✓ ✓ track | artist | play | avg_heartrate :---- | :----- | :--------------------------------------------------- | :----------------- a | b | ["2015-01-03 00:00:00","2015-01-03 00:03:42.413608") | 78.93074469582096 a | b | ["2015-01-10 00:00:00","2015-01-10 00:01:32.299356") | 83.89127804586359 a | b | ["2015-01-11 00:00:00","2015-01-11 00:03:24.722083") | 62.333722293527885 a | b | ["2015-01-19 00:00:00","2015-01-19 00:01:14.845757") | 77.65872734128969 a | b | ["2015-01-30 00:00:00","2015-01-30 00:01:40.991165") | 102.88233680407437 a | b | ["2015-02-06 00:00:00","2015-02-06 00:03:51.264716") | 70.34797302970127 a | b | ["2015-02-13 00:00:00","2015-02-13 00:01:23.358657") | 62.91734005187344 a | b | ["2015-02-25 00:00:00","2015-02-25 00:02:04.856602") | 115.45533419257616 a | b | ["2015-02-28 00:00:00","2015-02-28 00:02:46.800728") | 117.39846990343175 a | b | ["2015-03-18 00:00:00","2015-03-18 00:02:54.893186") | 68.1618921408235
db<>fiddle here
Thanks!
Best Answer
Change the join condition from
to
(if your ranges are open at the right end, else use different inequality operators).
Then the nested loop join can use a regular b-tree index on
heartrate(time)
to speed up the inner query.