Postgresql – Speeding Up Query on a Time Series Table. It’s Slow to get (MAX – MIN) Aggregation Value

aggregateperformancepostgresqlpostgresql-9.6query-performancetime-series-database

Question

How to improve this query to increase the speed of the query shown below from 30+ seconds to milliseconds? I am using PostgreSQL: v 9.6.6.

Context

I have a time series table buildings.hispoint that stores historical data for data points in a table buildings.point.

I need to aggregate the Max(value) - Min(value) as aggregation_value for different time ranges (e.g. year to date) for a large collection of data points.

This proves to be very slow and needs improvement.

Table Structure of buildings.hispoint (20,210,129 rows)

CREATE TABLE buildings.hispoint (
    id int,
    value float,
    datetime timestamp,
    point_id …
    CONSTRAINT foo FOREIGN KEY (point_id)
    REFERENCES buildings.point (point_id),
    …
);

Query

SELECT COUNT(datetime) AS id, 
MAX(value) - MIN(value) AS aggregation_value
FROM buildings_hispoint
WHERE point_id = 44 
AND buildings_hispoint.datetime BETWEEN '2018-01-01 00:00:00' AND '2018-05-02 09:18:14';

QUERY PLAN

Aggregate  (cost=160967.11..160967.12 rows=1 width=16) (actual time=21713.720..21713.720 rows=1 loops=1)
  Buffers: shared hit=7817 read=66145 dirtied=1
  ->  Bitmap Heap Scan on buildings_hispoint  (cost=5705.82..159708.13 rows=167864 width=16) (actual time=161.733..21585.478 rows=19783 loops=1)
        Recheck Cond: (point_id = 44)
        Rows Removed by Index Recheck: 3474664
        Filter: ((datetime >= '2018-01-01 00:00:00+00'::timestamp with time zone) AND (datetime <= '2018-05-02 09:18:14+00'::timestamp with time zone))
        Rows Removed by Filter: 306305
        Heap Blocks: exact=46580 lossy=26488
        Buffers: shared hit=7817 read=66145 dirtied=1
        ->  Bitmap Index Scan on buildings_measurementdatapoint_ffb10c68  (cost=0.00..5663.85 rows=306589 width=0) (actual time=139.360..139.360 rows=326088 loops=1)
              Index Cond: (point_id = 44)
              Buffers: shared read=894
Planning time: 40.504 ms
Execution time: 21717.750 ms

Best Answer

An index on (point_id, datetime, value) will likely speed up the query, as it will only have to do an index seek, read only the relevant part of the index and also have available (from the index) all the values of the value column to calculate the MIN and MAX.

An index on (A) (point_id) or (B) (point_id, datetime) would need to also find the read the relevant tuples from the table, in order (A) to filter them for the datetime range and (A and B) to read the values of the value column.