I have a query that is not using existing indices and I do not understand why.
The table:
mustang=# \d+ bss.amplifier_saturation
Table "bss.amplifier_saturation"
Column | Type | Modifiers | Storage | Description
--------+--------------------------+-------------------------------------------------------------------+---------+-------------
value | integer | not null | plain |
target | integer | not null | plain |
start | timestamp with time zone | not null | plain |
end | timestamp with time zone | not null | plain |
id | integer | not null default nextval('amplifier_saturation_id_seq'::regclass) | plain |
lddate | timestamp with time zone | not null default now() | plain |
Indexes:
"amplifier_saturation_pkey" PRIMARY KEY, btree (id)
"amplifier_saturation_target_start_end_key" UNIQUE CONSTRAINT, btree (target, start, "end")
"amplifier_saturation_end" btree ("end")
"amplifier_saturation_lddate" btree (lddate)
"amplifier_saturation_start" btree (start)
"amplifier_saturation_target" btree (target)
"amplifier_saturation_value" btree (value)
The query/plan:
mustang=# explain select max(lddate) from bss.amplifier_saturation
where start >= '1987-12-31 00:00:00'
and start <= '1988-04-09 00:00:00';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=189.41..189.42 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..189.41 rows=1 width=8)
-> Index Scan Backward using amplifier_saturation_lddate on amplifier_saturation (cost=0.00..2475815.50 rows=13071 width=8)
Index Cond: (lddate IS NOT NULL)
Filter: ((start >= '1987-12-31 00:00:00-08'::timestamp with time zone) AND (start <= '1988-04-09 00:00:00-07'::timestamp with time zone))
Why does this not use the index amplifier_saturation_start
? It seems to me like the DB should scan that to find the start date, and then continue through to delimit all entries 'til the end date, and finally sort that (small subset of) data for the maximum lddate
(something like pp40-41 of SQL Performance Explained).
I also tried an index on (start, start desc)
in desperation but it didn't help.
Incidentally, select count(*)
works just fine:
mustang=# explain select count(*) from bss.amplifier_saturation
where start >= '1987-12-31 00:00:00'
and start <= '1988-04-09 00:00:00';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=38711.84..38711.85 rows=1 width=0)
-> Index Scan using amplifier_saturation_start on amplifier_saturation (cost=0.00..38681.47 rows=12146 width=0)
Index Cond: ((start >= '1987-12-31 00:00:00-08'::timestamp with time zone) AND (start <= '1988-04-09 00:00:00-07'::timestamp with time zone))
- Running
ANALYZE
didn't help. pg_stats
shows a reasonable spread of values for start which seems to argue for using the index.- Setting statistics to 10,000 on either column (start or
lddate
) didn't help.
Maybe I should explain why I think the plan is wrong. The table contains 30,000,000 rows. Only 3,500 are in the date range. But maybe that's still too many for them to be read separately?
Adding an index on (lddate desc, start)
works (not sure the desc
is required). It can then use a pure index approach (IIUC) and it runs much faster:
mustang=# create index tmp_as on bss.amplifier_saturation (lddate desc, start);
CREATE INDEX
mustang=# explain select max(lddate) from bss.amplifier_saturation
where start >= '1987-12-31 00:00:00'
and start <= '1988-04-09 00:00:00';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=69.76..69.77 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..69.76 rows=1 width=8)
-> Index Scan using tmp_as on amplifier_saturation (cost=0.00..861900.22 rows=12356 width=8)
Index Cond: ((lddate IS NOT NULL) AND (start >= '1987-12-31 00:00:00-08'::timestamp with time zone) AND (start <= '1988-04-09 00:00:00-07'::timestamp with time zone))
So, I guess to answer my own question, it seems like the cost of accessing the data 3,500 times is slower than a 30,000,000 value scan (yay spinning disks). While a pure index scan is clearly better.
Maybe someone smarter than me comes up with a better answer?
Best Answer
Explanation
Even with
30,000,000 rows, only 3,500 in the date range
it can be faster to read tuples from the top of the indexamplifier_saturation_lddate
onlddate
. The first row that passes the filter onstart
can be returned as is. No sort step needed. With a perfectly random distribution a little below 9000 index tuples would have to be checked on average.Using
amplifier_saturation_start
, Postgres would still have to determinemax(lddate)
after fetching all 3500 qualifying rows. Close call. The decision depends on gathered statistics and your cost settings. Depending on data distribution and other details one or the other will be faster, and one or the other will be expected to be faster.Better index
This can be substantially faster with a multicolumn index on
(lddate, start)
like you found yourself already. This way Postgres can use an index-only scan and not touch the heap (the table) at all.But there is another minor thing you can improve. Did you wonder about of this detail in your
EXPLAIN
output?Why does Postgres have to exclude NULL values?
Because NULL sorts after the greatest value in
ASCENDING
, or before inDESCENDING
order. The maximum non-null value which is returned by the aggregate functionmax()
is not at the beginning / end of the index if there are NULL values. AddingNULLS LAST | FIRST
adjusts the sort order to the characteristic ofmax()
(and makes the oppositemin()
more expensive). Since we are mostly interested in the latest timestamp,DESC NULLS LAST
is the better choice.Now, your table column
lddate
obviously doesn't have NULL values, being definedNOT NULL
. The effect on performance will be negligible in this particular case. Still worth mentioning for cases that can have NULL.The other index option would be on
(start, lddate)
, basically a pimpedamplifier_saturation_start
index, which also would allow index-only scans. Depending on data distribution and actual parameter values in your query one or the other will be faster.Two notes concerning
timestamp
timestamptz
, but your query predicates usetimestamp
literals. Postgres derives the time zone from your currenttimezone
setting and adjusts accordingly. This may or may not be as intended. It certainly makes the query volatile - depending on a setting of your session. It would be problematic for a call that could be made from different time zones (with differing sessions settings). Then you'd rather use an explicit offset or theAT TIME ZONE
construct to make it stable. Details:<
instead of<=
.PST
(pacific standard time) being a random example time zone.