Postgresql – Compute duration of an event based on consecutive values in another column

aggregatepostgresql

I need to compute the sum of durations of multiple events based on consecutive values in another column.

             ts             | w
----------------------------+---
 2020-07-27 15:40:04.045+00 | t
 2020-07-27 15:41:04.045+00 | t
 2020-07-27 15:41:14.045+00 | f
 2020-07-27 15:42:14.045+00 | t
 2020-07-27 15:43:14.045+00 | t

The event is considered being active as long as the column w has a consecutive value of true.

The duration of the first event would be 60 seconds. '2020-07-27 15:41:04.045+00' - 2020-07-27 15:40:04.045+00. The second event has the same duration. The sum of both would be 120 seconds.

What would be the best/most performant approach to computing these values? The longest time range we'll be looking at will probably be half a year involving about 30 million rows.

I wrote a custom aggregate function that computes the duration but it takes about 16 seconds for only 1.5 million rows.

 Aggregate  (cost=444090.03..444090.04 rows=1 width=4) (actual time=16290.826..16290.828 rows=1 loops=1)
   ->  Seq Scan on discriminator0  (cost=0.00..57289.03 rows=1547203 width=9) (actual time=0.016..1723.178 rows=1547229 loops=1)
 Planning Time: 0.196 ms
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.889 ms, Inlining 0.000 ms, Optimization 0.508 ms, Emission 6.472 ms, Total 7.870 ms
 Execution Time: 16291.836 ms

I'm new to SQL and basically just got this working through trial and error, so I'm sure there is room for improvement. Maybe even a whole different approach. Here's a fiddle with the aggregate function.

I'm not sure if i should include the code because it's quite long

Best Answer

Here is one idea that you might want to try:

select max(ts) - min(ts)
from (
    select ts, w
       ,   row_number() over (order by ts)
         - row_number() over (partition by w order by ts) as grp
    from test
) as t
where w
group by grp;

The idea is to enumerate ts in two ways, first the complete set, then per w and calculate the difference (called grp).

If the difference (grp) changes it means that w changed. We can, therefore, pick the max(ts) - min(ts) within each grp, for groups where w = true.

The problem you are trying to solve is often referred to as an "island and gap" problem. I modified your fiddle with my query: Modified fiddle

You can try and compare the performance with your function. You probably want to add an index like:

CREATE INDEX x1 on test (ts, w);

EDIT: to calculate the total duration in seconds:

select EXTRACT(EPOCH from sum(dur))
from (
    select max(ts) - min(ts) as dur
    from (
        select ts, w
          ,    row_number() over (order by ts)
            -  row_number() over (partition by w order by ts) as grp
        from test
    ) as t
    where w
    group by grp
) as tt;    

EDIT: I played around a bit more, see Updated fiddle, and added a bit more data (~ 70000 rows depending on the number of duplicates ). I can see that the results differ between queries, haven't examined that more closely though. But I'm a bit surprised that you say that the queries perform about the same. Looking at the plan for your query:

Aggregate  (cost=21650.65..21650.66 rows=1 width=4) (actual time=2389.576..2389.576 rows=1 loops=1)
  ->  Seq Scan on test  (cost=0.00..1190.40 rows=81840 width=9) (actual time=0.013..36.243 rows=68709 loops=1)
Planning Time: 0.097 ms
Execution Time: 2389.663 ms

and one with window functions:

Aggregate  (cost=18951.01..18951.02 rows=1 width=8) (actual time=163.179..163.179 rows=1 loops=1)
  ->  HashAggregate  (cost=18946.01..18948.51 rows=200 width=24) (actual time=160.594..162.306 rows=8447 loops=1)
        Group Key: t.grp
        ->  Subquery Scan on t  (cost=16183.91..18639.11 rows=40920 width=16) (actual time=106.327..146.336 rows=59271 loops=1)
              Filter: t.w
              Rows Removed by Filter: 9438
              ->  WindowAgg  (cost=16183.91..17820.71 rows=81840 width=17) (actual time=106.326..140.013 rows=68709 loops=1)
                    ->  Sort  (cost=16183.91..16388.51 rows=81840 width=17) (actual time=106.316..114.565 rows=68709 loops=1)
                          Sort Key: test.ts
                          Sort Method: external merge  Disk: 2296kB
                          ->  WindowAgg  (cost=7868.76..9505.56 rows=81840 width=17) (actual time=44.285..81.613 rows=68709 loops=1)
                                ->  Sort  (cost=7868.76..8073.36 rows=81840 width=9) (actual time=44.261..52.843 rows=68709 loops=1)
                                      Sort Key: test.w, test.ts
                                      Sort Method: external merge  Disk: 1288kB
                                      ->  Seq Scan on test  (cost=0.00..1190.40 rows=81840 width=9) (actual time=0.011..5.775 rows=68709 loops=1)
Planning Time: 0.224 ms
Execution Time: 164.052 ms

indicate that the latter is significantly more efficient

I'll see if I can spin up a container with Postgres and dump the data into a table and examine the different results

The index is probably not that useful since we are scanning the whole table anyhow.

Final edit:

I could repeat the diff with only 50 rows Fiddle, so I added the rows to a Libreoffice Calc sheet. It appears as if the window version and Libreoffice agrees on the result. For some reason, I got a negative number from your function. This may be due to that the first row is a single row. This is the data used in verification, perhaps you can just add a "t" row a couple of seconds before the first one to rule out errors due to a single row:

2020-07-27 15:55:37 t
2020-07-27 16:01:56 f
2020-07-27 16:09:17 t
2020-07-27 16:23:32 t
2020-07-27 16:28:47 t   00:19:30
2020-07-27 18:14:32 f
2020-07-27 18:30:26 t
2020-07-27 18:40:52 t
2020-07-27 19:01:18 t
2020-07-27 19:05:24 t
2020-07-27 19:16:16 t
2020-07-27 19:56:00 t
2020-07-27 20:34:05 t   02:03:39
2020-07-27 21:12:18 f
2020-07-27 21:58:53 t
2020-07-27 23:51:55 t
2020-07-28 00:42:24 t
2020-07-28 00:43:56 t
2020-07-28 01:15:08 t
2020-07-28 03:34:08 t
2020-07-28 06:20:28 t
2020-07-28 06:23:14 t
2020-07-28 06:30:51 t
2020-07-28 06:34:12 t
2020-07-28 06:38:41 t
2020-07-28 06:48:48 t
2020-07-28 07:49:39 t
2020-07-28 08:28:41 t
2020-07-28 09:32:42 t
2020-07-28 09:45:36 t
2020-07-28 09:57:46 t
2020-07-28 10:29:37 t
2020-07-28 11:53:21 t
2020-07-28 11:53:29 t
2020-07-28 11:54:35 t
2020-07-28 12:23:18 t
2020-07-28 13:10:48 t
2020-07-28 13:37:21 t
2020-07-28 14:22:14 t
2020-07-28 14:51:16 t
2020-07-28 14:52:12 t
2020-07-28 15:30:04 t
2020-07-28 16:34:43 t
2020-07-28 17:02:03 t   19:03:10
2020-07-28 18:11:49 f
2020-07-28 18:21:09 t
2020-07-28 18:41:38 t
2020-07-28 19:14:45 t
2020-07-28 19:16:01 t
2020-07-28 19:22:03 t   01:00:54

             Total      22:27:13        80833 seconds

Final edit 2:-)

I had a look at your function, and it appears as if you expect the rows to be ordered according to ts? If so, that is a false assumption since they may come in any order. If you do a simple test like:

SELECT ddur(ts, w) FROM (select ts, w from test order by random()) as t;

and compare with:

SELECT ddur(ts, w) FROM test;

the results are probably different. See Modified original Fiddle