I have a scenario where a search on a range with the same value on both sides is using a plan with an index that is not the composite index I expected. After some work, I was able to generate a sample data that shows the different planning. The first query uses the idx_hourts_btree index, but the second is using another composite index. In the real scenario, the plan using the "anti-natural" gist composite generates very slow queries.
Setup:
create table sampledata as
select (row_number() over ())::int
, extract(hour from generate_series)::int as hour
, extract(minute from generate_series)::int as minute
, extract(second from generate_series)::int as second
, generate_series as ts
from generate_series (timestamptz '2004-03-08 18:29:00'
, timestamptz '2004-03-08 18:31:00'
, interval '1 millisecond');
create index idx_HourTs_btree on sampledata using btree(hour,ts);
create index idx_idts_gist on sampledata using gist(row_number,minute,second,ts);
analyze sampledata;
# \d sampledata
Table "public.sampledata"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+--- ------
row_number | integer | | |
hour | integer | | |
minute | integer | | |
second | integer | | |
ts | timestamp with time zone | | |
Indexes:
"idx_hourts_btree" btree (hour, ts)
"idx_idts_gist" gist (row_number, minute, second, ts)
Queries:
explain analyze
select * from sampledata where hour = 18 and ts between '2004-03-08 18:30:00.991' and '2004-03-08 18:30:00.993';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_hourts_btree on sampledata (cost=0.57..8.59 rows=1 width=16) (actual time=3.077..3.079 rows=3 loops=1)
Index Cond: ((hour = 18) AND (ts >= '2004-03-08 18:30:00.991-03'::timestamp with time zone) AND (ts <= '2004-03-08 18:30:00.993-03'::timestamp with time zone))
Planning time: 0.114 ms
Execution time: 3.101 ms
(4 rows)
Time: 4.090 ms
explain analyze
select * from sampledata where hour = 18 and ts between '2004-03-08 18:30:00.993' and '2004-03-08 18:30:00.993';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_idts_gist on sampledata (cost=0.55..8.57 rows=1 width=16) (actual time=5.985..5.988 rows=1 loops=1)
Index Cond: ((ts >= '2004-03-08 18:30:00.993-03'::timestamp with time zone) AND (ts <= '2004-03-08 18:30:00.993-03'::timestamp with time zone))
Filter: (hour = 18)
Planning time: 0.153 ms
Execution time: 6.030 ms
(5 rows)
Time: 7.318 ms
PG server: 9.5.9.
Why the different planning?
Here the analyze for the real table, obfuscated. "quebec" is the gist composite index (another column, mike), "uniform" is the btree index (hotel, mike)
Here with same value on between (bad exec time):
https://explain.depesz.com/s/O5Gf
Here with a small difference on between (good exec time):
https://explain.depesz.com/s/67yL
It's possible to simulate the problem with the table sample data I've provided on the question.
Update 2018-03-14
@Erwin's answer gave me more ideas to improve the sample data to something like the real case. In the previous sample, I was trying to simulate the real data with redundant columns that do not exist in the real world. I need the gist composite index for another use case, so before making a bigger design change I'd like to know if I'm not forgetting something simpler. The new sample should demonstrate the actual difference in execution and planning time. I've also tried increasing the statistics
target for columns, with no success.
Setup script:
show default_statistics_target;
show random_page_cost;
drop table if exists sampledata2;
create table sampledata2 as (with a as (select generate_series(1,50) as id) select id, md5(random()::text) rand, generate_series (timestamptz '2004-03-07', timestamptz '2004-03-17', interval '1 minute') ts from a);
select * from sampledata2 limit 3;
create index idx_idTs_btree on sampledata2 using btree(id, ts);
create index idx_randTs_gist on sampledata2 using gist(rand, ts);
analyze sampledata2;
explain analyze select * from sampledata2 where id=42 and ts between '2004-03-07 00:22:00-03' and '2004-03-07 00:22:00-03';
explain analyze select * from sampledata2 where id=42 and ts between '2004-03-07 00:22:00-03' and '2004-03-07 00:23:00-03';
alter table sampledata2 alter column id set statistics 10000;
alter table sampledata2 alter column rand set statistics 10000;
alter table sampledata2 alter column ts set statistics 10000;
analyze sampledata2;
explain analyze select * from sampledata2 where id=42 and ts between '2004-03-07 00:22:00-03' and '2004-03-07 00:22:00-03';
explain analyze select * from sampledata2 where id=42 and ts between '2004-03-07 00:22:00-03' and '2004-03-07 00:23:00-03';
Output:
default_statistics_target
---------------------------
50
(1 row)
Time: 0.640 ms
random_page_cost
------------------
4
(1 row)
Time: 0.257 ms
DROP TABLE
Time: 35.850 ms
SELECT 720050
Time: 1438.842 ms (00:01.439)
id | rand | ts
----+----------------------------------+------------------------
1 | 8e1d3920ef44f94e71291b2371178ece | 2004-03-07 00:00:00-03
1 | 664fcfc94e09ea0ff050b934e6cb486f | 2004-03-07 00:01:00-03
1 | ac52031c8d98df67e2aacaf7d10b3af7 | 2004-03-07 00:02:00-03
(3 rows)
Time: 0.651 ms
CREATE INDEX
Time: 356.923 ms
CREATE INDEX
Time: 35661.019 ms (00:35.661)
ANALYZE
Time: 86.580 ms
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_randts_gist on sampledata2 (cost=0.41..8.43 rows=1 width=45) (actual time=10.851..18.430 rows=1 loops=1)
Index Cond: ((ts >= '2004-03-07 00:22:00-03'::timestamp with time zone) AND (ts <= '2004-03-07 00:22:00-03'::timestamp with time zone))
Filter: (id = 42)
Rows Removed by Filter: 49
Planning time: 0.224 ms
Execution time: 18.479 ms
(6 rows)
Time: 19.422 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_idts_btree on sampledata2 (cost=0.42..8.45 rows=1 width=45) (actual time=0.040..0.041 rows=2 loops=1)
Index Cond: ((id = 42) AND (ts >= '2004-03-07 00:22:00-03'::timestamp with time zone) AND (ts <= '2004-03-07 00:23:00-03'::timestamp with time zone))
Planning time: 0.144 ms
Execution time: 0.067 ms
(4 rows)
Time: 0.803 ms
ALTER TABLE
Time: 1.220 ms
ALTER TABLE
Time: 0.924 ms
ALTER TABLE
Time: 0.894 ms
ANALYZE
Time: 2675.784 ms (00:02.676)
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_randts_gist on sampledata2 (cost=0.41..8.43 rows=1 width=45) (actual time=6.472..11.493 rows=1 loops=1)
Index Cond: ((ts >= '2004-03-07 00:22:00-03'::timestamp with time zone) AND (ts <= '2004-03-07 00:22:00-03'::timestamp with time zone))
Filter: (id = 42)
Rows Removed by Filter: 49
Planning time: 0.757 ms
Execution time: 11.524 ms
(6 rows)
Time: 12.948 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_idts_btree on sampledata2 (cost=0.42..8.45 rows=1 width=45) (actual time=0.021..0.022 rows=2 loops=1)
Index Cond: ((id = 42) AND (ts >= '2004-03-07 00:22:00-03'::timestamp with time zone) AND (ts <= '2004-03-07 00:23:00-03'::timestamp with time zone))
Planning time: 0.505 ms
Execution time: 0.045 ms
(4 rows)
Time: 1.170 ms
Update 2018-03-15
A bug was reported, for this moment there will be an improvement related to this issue in v11. https://www.postgresql.org/message-id/31902.1521064417%40sss.pgh.pa.us
Best Answer
I used this simpler test setup instead:
345601 rows instead of 345M are enough to test the effect in question.
Your primary question:
The answer: because of the faster estimate for the GiST index in the 2nd case. In my setup:
The plan after deleting the GiST index with
DROP INDEX idx_idts_gist;
:Note the bold estimate. So the GiST index wins. Postgres has no other option than to rely on its estimates to decide the "best" query plan. Even though the btree index turns out to be faster for the particular case, Postgres can't know that before executing the query.
Estimates are based on table statistics, cost settings and intricate details of the query planner that calculates estimates based on these numbers. If table statistics, cost settings or characteristics of certain operations (index scan on different index types in particular for this case) are inaccurate, the estimate is off more or less.
Query & indexes
All that aside, your index and query has several flaws.
Don't use a
timestamp
literal for atimestamptz
value. It makes you depend on the current time zone setting, which can have unexpected effects like different results in different sessions. Providetimestamptz
literals!Or, more explicit:
Also note
hour = 18
is gone. See below.The explicit data type cast is redundant while we know that ts is type
timestamptz
.Details:
Also, query and indexes are inefficient in several aspects:
hour = 18
in theWHERE
clause only seems to make sense while working with the sub-optimal index on(hour, ts)
. Remove it.hour
is not helping your queries in the indexidx_HourTs_btree
. It's completely redundant, just bloat. Remove it.row_number
inidx_idts_gist
is of no use for the given queries, either. Remove it.These indexes would be cheaper and faster:
Combined with the query above.
The completely redundant column
hour
has no conceivable use at all. Just drop it, thereby making the table smaller and everything a bit faster.Aside, since you seem to be operating on large cardinalities, a BRIN index might be an interesting alternative if your data is physically sorted, mostly:
Or maybe with fewer pages per range:
Read performance will be faster with a btree index if you have sufficient memory and your index is cached most of the time. The major benefit of the BRIN index is the extremely small index size. Test and consider the complete situation.
Addressing update 2018-03-14
Increasing the statistics target like you demonstrate does have an effect on cost estimates in my tests. Just not on the special case of
BETWEEN
with a single value as lower and upper bound (y BETWEEN x AND x
). The Postgres query planner is no AI, nor does it try to be: estimations have to be simple and fast.(Aside, only statistics for
id
andts
matter, not onrand
.)The query planner obviously underestimates the cost for the GiST for this special case, which has a different characteristic than an actual range. I can't say that I blame the Postgres developers that nobody payed attention to this special case, since
y BETWEEN x AND x
does not make a whole lot of sense to begin with. Normally, one would simplify toy = x
, for which Postgres switches to using the btree index in my tests.You can run a different query with
y = x
for the special case. You have to check user input anyway, since the upper bound cannot be below the lower bound forBETWEEN
.You could also create the much more efficient GIST index on just
(ts)
I mentioned above, which delivers performance close to the btree index in my tests.You might report a bug and see if any developer bites. But first update your Postgres version to 9.5.12 at least to avoid noise.
And you did.
Turns out this has been improved in /devel to be released with Postgres 11. But the change won't be back-patched to older versions.