I need to compute the sum of durations of multiple events based on consecutive values in another column.
ts | w
----------------------------+---
2020-07-27 15:40:04.045+00 | t
2020-07-27 15:41:04.045+00 | t
2020-07-27 15:41:14.045+00 | f
2020-07-27 15:42:14.045+00 | t
2020-07-27 15:43:14.045+00 | t
The event is considered being active as long as the column w
has a consecutive value of true
.
The duration of the first event would be 60 seconds. '2020-07-27 15:41:04.045+00' - 2020-07-27 15:40:04.045+00
. The second event has the same duration. The sum of both would be 120 seconds.
What would be the best/most performant approach to computing these values? The longest time range we'll be looking at will probably be half a year involving about 30 million rows.
I wrote a custom aggregate function that computes the duration but it takes about 16 seconds for only 1.5 million rows.
Aggregate (cost=444090.03..444090.04 rows=1 width=4) (actual time=16290.826..16290.828 rows=1 loops=1)
-> Seq Scan on discriminator0 (cost=0.00..57289.03 rows=1547203 width=9) (actual time=0.016..1723.178 rows=1547229 loops=1)
Planning Time: 0.196 ms
JIT:
Functions: 3
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.889 ms, Inlining 0.000 ms, Optimization 0.508 ms, Emission 6.472 ms, Total 7.870 ms
Execution Time: 16291.836 ms
I'm new to SQL and basically just got this working through trial and error, so I'm sure there is room for improvement. Maybe even a whole different approach. Here's a fiddle with the aggregate function.
I'm not sure if i should include the code because it's quite long
Best Answer
Here is one idea that you might want to try:
The idea is to enumerate ts in two ways, first the complete set, then per w and calculate the difference (called grp).
If the difference (grp) changes it means that w changed. We can, therefore, pick the max(ts) - min(ts) within each grp, for groups where w = true.
The problem you are trying to solve is often referred to as an "island and gap" problem. I modified your fiddle with my query: Modified fiddle
You can try and compare the performance with your function. You probably want to add an index like:
EDIT: to calculate the total duration in seconds:
EDIT: I played around a bit more, see Updated fiddle, and added a bit more data (~ 70000 rows depending on the number of duplicates ). I can see that the results differ between queries, haven't examined that more closely though. But I'm a bit surprised that you say that the queries perform about the same. Looking at the plan for your query:
and one with window functions:
indicate that the latter is significantly more efficient
I'll see if I can spin up a container with Postgres and dump the data into a table and examine the different results
The index is probably not that useful since we are scanning the whole table anyhow.
Final edit:
I could repeat the diff with only 50 rows Fiddle, so I added the rows to a Libreoffice Calc sheet. It appears as if the window version and Libreoffice agrees on the result. For some reason, I got a negative number from your function. This may be due to that the first row is a single row. This is the data used in verification, perhaps you can just add a "t" row a couple of seconds before the first one to rule out errors due to a single row:
Final edit 2:-)
I had a look at your function, and it appears as if you expect the rows to be ordered according to ts? If so, that is a false assumption since they may come in any order. If you do a simple test like:
and compare with:
the results are probably different. See Modified original Fiddle