PostgreSQL Oldest Values – How to Get Efficiently Over Each ID

greatest-n-per-groupperformancepostgresqlpostgresql-11query-performance

How can PostgreSQL return a list of the oldest timestamp values over a table of sensor id measurements?

Let me explain the situation with a sample table:

CREATE TABLE sensor_data(
sensor_id INTEGER,
time TIMESTAMPTZ,
value NUMERIC,
PRIMARY KEY (sensor_id, time)
)

Populated table example:

+-----------+------------------+-------+
| sensor_id |       time       | value |
+-----------+------------------+-------+
|         1 | 2018-01-01 00:00 |     1 |
|         1 | 2018-01-01 01:00 |     2 |
|         3 | 2018-01-01 03:00 |     4 |
|         3 | 2018-01-01 04:00 |     3 |
|         4 | 2018-01-01 03:00 |     5 |
|         4 | 2018-01-01 04:00 |     6 |
+-----------+------------------+-------+

While using something like sensor_id (1,3) inside the query I want it to return something like this:

+-----------+------------------+-------+
| sensor_id |       time       | value |
+-----------+------------------+-------+
|         1 | 2018-01-01 01:00 |     2 |
|         3 | 2018-01-01 04:00 |     3 |
+-----------+------------------+-------+

How can I do that in a query using the PRIMARY KEY index for speeding it up?

Best Answer

There are many possible query styles, most will readily use your PK index on (sensor_id, time) as it fits the task. (Postgres can read indexes backwards practically as fast.) This should be near perfect:

SELECT s.sensor_id, sd.time, sd.value
FROM   unnest ('{1,3}'::int[]) s(sensor_id)
LEFT   JOIN LATERAL (
   SELECT *
   FROM   sensor_data sd
   WHERE  sd.sensor_id = s.sensor_id
   ORDER  BY time DESC
   LIMIT  1
   ) sd ON true;

db<>fiddle here

LEFT JOIN .. ON true keeps sensors without any data entries in the result - with NULL values in place of values.

Since you are on Postgres 11, a covering index might pay:

... PRIMARY KEY (sensor_id, time) INCLUDE (value)

But it makes the index bigger and writes to the table more expensive, and your names indicate a write-heavy table. And while you only query for few rows at a time, queries don't get much faster anyway. So probably best the way you have it. Related:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

Now for your question...

The most sensible approach, no JOINs, no sub-SELECTs, is the following:

DELETE FROM TABLE WHERE value_was IS NULL OR value_was <= value_now;

Will you look at that. It's your query from the question. Why suggest your original idea ?

It is a full table scan IN ONE PASS. Any other approach can potentially double the work (or triple it if you try to get indexes involved this late in the game). Running it this way also delays the need to defragment the table.

If you want to delete and defragment, here are two options.

OPTION #1

MyISAM

DELETE FROM `TABLE` WHERE value_was IS NULL OR value_was <= value_now;
ALTER TABLE `TABLE` ENGINE=MyISAM;

InnoDB

DELETE FROM `TABLE` WHERE value_was IS NULL OR value_was <= value_now;
ALTER TABLE `TABLE` ENGINE=InnoDB;

OPTION #2

DELETE FROM `TABLE` WHERE value_was IS NULL OR value_was <= value_now;
CREATE TABLE `NEWTABLE`
SELECT * FROM `TABLE`
WHERE NOT (value_was IS NULL OR value_was <= value_now);
DROP TABLE `TABLE`;
ALTER TABLE `NEWTABLE` RENAME `TABLE`;

CAVEAT

Before you do anything, run this count

SELECT COUNT(1) INTO @Count_All FROM `TABLE`;
SELECT COUNT(1) INTO @Count_Zap FROM `TABLE`
WHERE value_was IS NULL OR value_was <= value_now;
SET @DeletePct = @Count_Zap * 100 / @Count_All;
SELECT @DeletePct;

@DeletePct is the Percentage of the Table that will be deleted if you run the DELETE.

If the Percentage is too low for you, then DELETE FROMTABLEWHERE value_was IS NULL OR value_was <= value_now; is all you need. Defragmentation can wait. Otherwise, you may choose one of the options or live with the table's row fragmentation.

On a side note, if you wish employ the use of indexes, please so after defragmenting the table.

Postgresql – Nearest value for each foreign key

You can use window functions to achieve your goal. lag() and lead() are ones which can help you in a query like

SELECT lag(di_timestamp) OVER ordering, lead(di_timestamp) OVER ordering
  FROM data_item
 WHERE fk_fc_id IN (35246,35247)
WINDOW ordering AS (ORDER BY di_timestamp 
                    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);

This will return the previous and the next timestamps, if there is any.

Best Answer

Related Solutions

Mysql – How to use subquery on the same table in MySQL

Now for your question...

OPTION #1

OPTION #2

CAVEAT

Postgresql – Nearest value for each foreign key

Related Question