PostgreSQL Performance – Slow Index Scans in Large Table

indexoptimizationperformancepostgresqlpostgresql-performance

Update 2020-08-04:

Since this answer is apparently still being viewed regularly I wanted to provide an update on the situation. We're currently using PG 11 with table partitioning on timestamp and are easily handling a few billion rows in the table(s). The index-only scans are a life saver and it wouldn't be possible without.


Using PostgreSQL 9.2, I have troubles with slow queries on a relatively large table (200+ million rows). I'm not trying anything crazy, just adding historic values. Below is the query and the query plan output.

My table layout:

                                   Table "public.energy_energyentry"
  Column   |           Type           |                            Modifiers
-----------+--------------------------+-----------------------------------------------------------------
 id        | integer                  | not null default nextval('energy_energyentry_id_seq'::regclass)
 prop_id   | integer                  | not null
 timestamp | timestamp with time zone | not null
 value     | double precision         | not null
Indexes:
    "energy_energyentry_pkey" PRIMARY KEY, btree (id)
    "energy_energyentry_prop_id" btree (prop_id)
    "energy_energyentry_prop_id_timestamp_idx" btree (prop_id, "timestamp")
Foreign-key constraints:
    "energy_energyentry_prop_id_fkey" FOREIGN KEY (prop_id) REFERENCES gateway_peripheralproperty(id) DEFERRABLE INITIALLY DEFERRED

The data ranges from 2012-01-01 till now, with new data constantly being added. There are about 2.2k distinct values in the prop_id foreign key, distributed evenly.

I notice that the row estimates aren't far off, but the cost estimates seem larger by factor 4x. This probably isn't an issue, but is there anything I could do about it?

I expect that disk access might be the issue, since the table isn't in memory all the time.

EXPLAIN ANALYZE 
SELECT SUM("value") 
FROM "energy_energyentry" 
WHERE 
  "prop_id"=82411 
  AND "timestamp">'2014-06-11' 
  AND "timestamp"<'2014-11-11'
;
 Aggregate  (cost=214481.45..214481.46 rows=1 width=8) (actual time=51504.814..51504.814 rows=1 loops=1)
   ->  Index Scan using energy_energyentry_prop_id_timestamp_idx on  energy_energyentry (cost=0.00..214434.08 rows=18947 width=8) (actual time=136.030..51488.321 rows=13578 loops=1)
         Index Cond: ((prop_id = 82411) AND ("timestamp" > '2014-06-11 00:00:00+00'::timestamp with time zone) AND ("timestamp" < '2014-11-11 00:00:00+00'::timestamp with time zone))
 Total runtime: 51504.841 ms

Any suggestions how to make this faster?
I'm also fine with just hearing I didn't do anything weird.

Best Answer

Your table is big, and so is any index spanning the whole table. Assuming that:

  • only new data (with timestamp = now()) is entered
  • existing rows are neither changed nor deleted.
  • you have data since 2012-01-01 but queries are predominantly on the current year (?)

I would suggest a partial, multi-column (covering!) index:

CREATE INDEX ON energy_energyentry (prop_id, "timestamp", value)
WHERE "timestamp" >= '2014-01-01 0:0';  -- adapt to your needs

Only include the time range that is queried regularly. Effectiveness deteriorates over time with new entries. Recreate the index from time to time. (You may need to adapt your queries.) See linked answer below.

The last column value is only included to get index-only scans out of this. Aggressive autovacuum setting may help by keeping the visibility map up to date, like @jjanes already mentioned.

The partial index should fit into RAM more easily and stay there longer.

You may need to include this WHERE condition in queries to make the planner understand the index is applicable to the query, like:

SELECT sum(value) AS sum_value
FROM   energy_energyentry
WHERE  prop_id = 82411 
AND   "timestamp" > '2014-06-11 0:0' 
AND   "timestamp" < '2014-11-11 0:0'
AND   "timestamp" >= '2014-01-01 0:0'; -- seems redundant, but may be needed

Since your query is summing a lot of rows (rows=13578), this is going to take some time, even with an index-only scan. It shouldn't be anywhere near 50 seconds, though. Less than a second on any halfway decent hardware.

Related (but ignore CLUSTER and FILLFACTOR, both are irrelevant if you can get index-only scans out of this):

Aside:
Since you currently have an index on (prop_id, "timestamp"), the additional index on just (prop_id) may cost more than it's worth: