Postgresql – How to merge overlapping time intervals if the interval not more than 4 minutes in PostgreSql

postgresql

How to merge overlapping time intervals if the interval not more than 4 minutes (for example only where id = 1).

I have the next table:

--------------------------------------
id | action    | date
--------------------------------------
1  | started   | 2020-08-18 13:51:02
1  | suspended | 2020-08-18 13:51:04
2  | started   | 2020-08-18 13:52:14
2  | suspended | 2020-08-18 13:52:17
3  | started   | 2020-08-18 13:52:21
3  | suspended | 2020-08-18 13:52:24
1  | started   | 2020-08-18 13:57:21
1  | suspended | 2020-08-18 13:57:22
1  | started   | 2020-08-18 15:07:56
1  | suspended | 2020-08-18 15:08:56
1  | started   | 2020-08-18 15:09:11
1  | suspended | 2020-08-18 15:09:11
1  | started   | 2020-08-18 15:09:11
1  | suspended | 2020-08-18 15:09:13

Expected result:

--------------------------------------
id | action    | date
--------------------------------------
1  | started   | 2020-08-18 13:51:02
1  | suspended | 2020-08-18 13:52:24
1  | started   | 2020-08-18 13:57:21
1  | suspended | 2020-08-18 13:57:22
1  | started   | 2020-08-18 15:07:56
1  | suspended | 2020-08-18 15:09:13

I will be very grateful for your help!

Best Answer

As I mentioned in my comment, your request is not well defined because it is unclear how the 4-minute wide time windows are supposed to be chosen. In addition, the nature and importance of the id column is unclear.

Be that as it may, I came up with the following query. It uses window functions in order to define the group edges, and then filters out any row that is not an edge:

WITH t AS (
    SELECT id, action, date
         , min(date) OVER (PARTITION BY round(extract(epoch FROM date)/240)) min_date
         , max(date) OVER (PARTITION BY round(extract(epoch FROM date)/240)) max_date
    FROM tab
)
SELECT id, action, date
FROM t
WHERE date = min_date OR date = max_date;

Related Solutions

Postgresql – Optimizing a query on a large table (EXPLAIN ANALYZE result inside)

A few thoughts. Consider range partitioning on "timestamp", it will reduce the amount of work your query have to do. Further optimization might be to calculate and store the agg for "closed" partitions. You will of course have to recalculate this when you modify historical information.

As a bonus it will be much easier to roll-out historical data that is no longer needed.

As mentioned this is just some thoughts and reflections, it may or it may not fit your situation. One pitfall may be the number of different intervals that you use for reporting. If larger intervals is not multiples of the partitioning interval this wont work well.

Edit: How is the plan affected by creating an temp table and an index on "timestamp"?

create index X on market_trades ( timestamp );

create temp table T ( time_range tsrange );

insert into T ( time_range )
WITH vals AS (
    SELECT '2013-08-19 0:00'::timestamp  AS frame_start,
       '2013-08-26 0:00'::timestamp  AS frame_end,
       '17h'::interval               AS interval_length
),   intervals AS (
    SELECT tsrange(start_time,
               lead(start_time, 1, frame_end) OVER (ORDER BY start_time   NULLS FIRST)) AS time_range
    FROM (
        SELECT generate_series(frame_start, frame_end, interval_length) AS  
               start_time, frame_end
        FROM vals
    ) _
    WHERE start_time < frame_end
)
SELECT time_range
FROM intervals i;

create index x on T ( time_range );

analyze t;

SELECT time_range, count(td.id) AS agg
FROM T
LEFT JOIN market_trades td
    ON td.timestamp <@ T.time_range
GROUP  BY T.time_range
ORDER  BY T.time_range;

Postgresql – Pipelinedb: How to group stream data by each N minutes in continuous view

Maybe this can helps you.

I've divided the clock in groups of 5 minutes:

group of 5 min = extract('minute' from time)::int / 5

min 0 to 4 = 0
min 5 to 9 = 1
....

This adds an extra column to your current table.

select *, extract('minute' from time)::int / 5 as g_of_5m 
from foo 
order by ip, time;

+----+----------+-----------------+------------+---------+
|    |   time   |        ip       |    type    | g_of_5m |
+----+----------+-----------------+------------+---------+
| 1  | 22:35:00 | 111.111.111.111 |  page_open |    7    |
+----+----------+-----------------+------------+---------+
| 2  | 22:36:00 | 111.111.111.111 |  page_open |    7    |
+----+----------+-----------------+------------+---------+
| 3  | 22:37:00 | 111.111.111.111 | page_close |    7    |
+----+----------+-----------------+------------+---------+
| 4  | 22:42:00 | 111.111.111.111 | page_close |    8    |
+----+----------+-----------------+------------+---------+
| 5  | 22:44:00 | 111.111.111.111 |  page_open |    8    |
+----+----------+-----------------+------------+---------+
| 6  | 22:42:00 | 222.111.111.111 |  page_open |    8    |
+----+----------+-----------------+------------+---------+
| 7  | 22:43:00 | 222.111.111.111 |  page_open |    8    |
+----+----------+-----------------+------------+---------+
| 8  | 22:44:00 | 222.111.111.111 | page_close |    8    |
+----+----------+-----------------+------------+---------+

As Evan Carroll has pointed in his comment, it should be grouped by hour,g_of_5m, just to take care of this case: (same g_of_5m, more than hour)

22:40  111.111.111.111  page_open
23:41  111.111.111.111  page_open

And this is the query:

select min(time) as init_time, ip, type, count(*) as count
from foo
group by ip, type, extract('hour' from time), extract('minute' from time)::int / 5
order by ip, min(time);

+----+-----------+-----------------+------------+-------+
|    | init_time |        ip       |    type    | count |
+----+-----------+-----------------+------------+-------+
| 1  |  22:35:00 | 111.111.111.111 |  page_open |   2   |
+----+-----------+-----------------+------------+-------+
| 2  |  22:37:00 | 111.111.111.111 | page_close |   1   |
+----+-----------+-----------------+------------+-------+
| 3  |  22:42:00 | 111.111.111.111 | page_close |   1   |
+----+-----------+-----------------+------------+-------+
| 4  |  22:44:00 | 111.111.111.111 |  page_open |   1   |
+----+-----------+-----------------+------------+-------+
| 5  |  22:42:00 | 222.111.111.111 |  page_open |   2   |
+----+-----------+-----------------+------------+-------+
| 6  |  22:44:00 | 222.111.111.111 | page_close |   1   |
+----+-----------+-----------------+------------+-------+

Just if you want to check it: http://rextester.com/CVWBM15291

Best Answer

Related Solutions

Postgresql – Optimizing a query on a large table (EXPLAIN ANALYZE result inside)

Postgresql – Pipelinedb: How to group stream data by each N minutes in continuous view

Related Question