I use PostgreSQL 9.1 on Ubuntu 12.04.
I need to select records inside a range of time: my table time_limits
has two timestamp
fields and one integer
property. There are additional columns in my actual table that are not involved with this query.
create table (
start_date_time timestamp,
end_date_time timestamp,
id_phi integer,
primary key(start_date_time, end_date_time,id_phi);
This table contains roughly 2M records.
Queries like the following took enormous amounts of time:
select * from time_limits as t
where t.id_phi=0
and t.start_date_time <= timestamp'2010-08-08 00:00:00'
and t.end_date_time >= timestamp'2010-08-08 00:05:00';
So I tried adding another index – the inverse of the PK:
create index idx_inversed on time_limits(id_phi, start_date_time, end_date_time);
I got the impression that performance improved: The time for accessing records in the middle of the table seems to be more reasonable: somewhere between 40 and 90 seconds.
But it's still several tens of seconds for values in the middle of the time range. And twice more when targeting the end of the table (chronologically speaking).
I tried explain analyze
for the first time to get this query plan:
Bitmap Heap Scan on time_limits (cost=4730.38..22465.32 rows=62682 width=36) (actual time=44.446..44.446 rows=0 loops=1)
Recheck Cond: ((id_phi = 0) AND (start_date_time <= '2011-08-08 00:00:00'::timestamp without time zone) AND (end_date_time >= '2011-08-08 00:05:00'::timestamp without time zone))
-> Bitmap Index Scan on idx_time_limits_phi_start_end (cost=0.00..4714.71 rows=62682 width=0) (actual time=44.437..44.437 rows=0 loops=1)
Index Cond: ((id_phi = 0) AND (start_date_time <= '2011-08-08 00:00:00'::timestamp without time zone) AND (end_date_time >= '2011-08-08 00:05:00'::timestamp without time zone))
Total runtime: 44.507 ms
See the results on depesz.com.
What could I do to optimize the search? You can see all the time is spent scanning the two timestamps columns once id_phi
is set to 0
. And I don't understand the big scan (60K rows!) on the timestamps. Aren't they indexed by the primary key and idx_inversed
I added?
Should I change from timestamp types to something else?
I have read a little about GIST and GIN indexes. I gather they can be more efficient on certain conditions for custom types. Is it a viable option for my use case?
Best Answer
For Postgres 9.1 or later:
In most cases the sort order of an index is hardly relevant. Postgres can scan backwards practically as fast. But for range queries on multiple columns it can make a huge difference. Closely related:
Consider your query:
Sort order of the first column
id_phi
in the index is irrelevant. Since it's checked for equality (=
), it should come first. You got that right. More in this related answer:Postgres can jump to
id_phi = 0
in next to no time and consider the following two columns of the matching index. These are queried with range conditions of inverted sort order (<=
,>=
). In my index, qualifying rows come first. Should be the fastest possible way with a B-Tree index1:start_date_time <= something
: index has the earliest timestamp first.Recurse until the first row fails to qualify (super fast).
end_date_time >= something
: index has the latest timestamp first.Continue with next value for column 2 ..
Postgres can either scan forward or backward. The way you had the index, it has to read all rows matching on the first two columns and then filter on the third. Be sure to read the chapter Indexes and
ORDER BY
in the manual. It fits your question pretty well.How many rows match on the first two columns?
Only few with a
start_date_time
close to the start of the time range of the table. But almost all rows withid_phi = 0
at the chronological end of the table! So performance deteriorates with later start times.Planner estimates
The planner estimates
rows=62682
for your example query. Of those, none qualify (rows=0
). You might get better estimates if you increase the statistics target for the table. For 2.000.000 rows ...... might pay. Or even higher. More in this related answer:
I guess you don't need that for
id_phi
(only few distinct values, evenly distributed), but for the timestamps (lots of distinct values, unevenly distributed).I also don't think it matters much with the improved index.
CLUSTER
/ pg_repack / pg_squeezeIf you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively (at off hours for instance), rewrite your table and order rows according to the index with
CLUSTER
:Or consider pg_repack or the later pg_squeeze, which can do the same without exclusive lock on the table.
Either way, the effect is that fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time with writes on the table fragmenting the physical sort order.
GiST index in Postgres 9.2+
1 With pg 9.2+ there is another, possibly faster option: a GiST index for a range column.
There are built-in range types for
timestamp
andtimestamp with time zone
:tsrange
,tstzrange
. A btree index is typically faster for an additionalinteger
column likeid_phi
. Smaller and cheaper to maintain, too. But the query will probably still be faster overall with the combined index.Change your table definition or use an expression index.
For the multicolumn GiST index at hand you also need the additional module
btree_gist
installed (once per database) which provides the operator classes to include aninteger
.The trifecta! A multicolumn functional GiST index:
Use the "contains range" operator
@>
in your query now:SP-GiST index in Postgres 9.3+
An SP-GiST index might be even faster for this kind of query - except that, quoting the manual:
Still true in Postgres 12.
You would have to combine an
spgist
index on just(tsrange(...))
with a secondbtree
index on(id_phi)
. With the added overhead, I'm not sure this can compete.Related answer with a benchmark for just a
tsrange
column: