Postgresql – Time based sampling

join;postgresqlpostgresql-performancequery-performancewindow functions

Is there any way to do sampling on time series data in PostgreSQL (v11) based on a fixed time interval?

For example, we are storing 1 record per second for 24 hours in a given table. This means approximately 86400 records are in the table per day.

Is there any simpler way to retrieve a data point for every 10 seconds of data?
The query should return 8640 records.

Currently, I have implemented a solution using a window function. But for larger data sets it is taking too much time.

Current implementation

SELECT
message_date,
message
FROM (
    SELECT ROW_NUMBER() OVER (
        PARTITION BY date_trunc('minute', message_date)
        ORDER BY message_date DESC
    ) AS r, t.*
    FROM rdm.telemetry_2021_01_21 t
    WHERE device_id = 'CP2'
    AND message_date >= '1/21/2021 01:00:00 AM'
    AND message_date <= '1/21/2021 11:00:00 PM'
) x
WHERE x.r <= 1 ORDER BY message_date DESC;

The above solution supports sampling for 1 sec, 1 min or 1 hr depending on which values is provided in the date_trunc options. And also, this query takes too much time against a table that contains millions of records.

"Explain analysis" result

EXPLAIN plan

Can anyone please suggest some other option/solution? Maybe using lateral join?

Best Answer

Basic query to sample data for arbitrary fixed time intervals:

SELECT grid_time, message_date, message
FROM   generate_series(timestamp '2021-01-21 01:00:00'   -- always use unambiguous ISO format
                     , timestamp '2021-01-21 11:00:00'
                     , interval '73 minutes') grid_time  -- arbitrary time interval
LEFT   JOIN LATERAL (
   SELECT message_date, message
   FROM   rdm.telemetry_2021_01_21
   WHERE  device_id = 'CP2'
   AND    message_date >= grid_time
   AND    message_date <  grid_time + interval '73 minutes'  -- same interval
   ORDER  BY message_date
   LIMIT  1
   ) t ON true;

Support this with a multicolumn index on (device_id, message_date).

For a time interval of 10 seconds like you mentioned, you don't need an index. This would fetch 10 % of all rows, and Postgres will typically default to a sequential scan as that is faster. (Exceptions with index-only scans apply.)

The added filter device_id = 'CP2' can change a lot. But that's only in your query, not in your expressed requirements.

An index typically only helps performance for filters retrieving a few percent of all rows or less. So like 30 seconds or more in your example. For selective filters, an index can help a lot.

Further reading: