There's probably something more elegant than this, but since you are doing a kind of pivot (transforming from columns to rows, i.e. metadata to data) it is never going to be pretty. Assuming a table like:
create table t
( ts timestamp not null
, EngOilP_sd smallint not null
, CompOilLVL_sd smallint not null
) engine = innodb;
insert into t (ts, EngOilP_sd, CompOilLVL_sd)
values
('2015-06-24 20:28:07',0 ,0 )
,('2015-06-24 20:30:20',1 ,0 )
,('2015-06-24 20:36:47',1 ,0 )
,('2015-06-24 20:41:11',1 ,0 )
,('2015-06-24 20:43:29',1 ,0 )
,('2015-06-24 20:45:42',0 ,0 )
,('2015-06-24 20:47:51',0 ,0 )
,('2015-06-24 20:49:59',0 ,1 )
,('2015-06-24 20:52:01',0 ,1 )
,('2015-06-24 20:54:17',0 ,0 );
Since we are going to reference the "pivot" several times we create a view for it:
create view t_pivot as
select 'EngOilP_sd' as sd, ts, EngOilP_sd as val from t
union all
select 'CompOilLVL_sd', ts, CompOilLVL_sd from t;
The start time (using @Pauls comment as the definition) is the earliest time where each attribute is 1:
select sd, min(case val when 1 then ts end) as start
from t_pivot
group by sd
The stop time is the earliest time after that where each attribute is 0:
select x.sd, x.start, min(case y.val when 0 then y.ts end) as stop
from (
select sd, min(case val when 1 then ts end) as start
from t_pivot
group by sd
) as x
join t_pivot as y
on x.sd = y.sd
and y.ts > x.start
group by x.sd, x.start
Adding another level of nesting makes it convenient to determine the duration:
select sd, timediff(stop, start) as duration, start, stop
from (
select x.sd, x.start, min(case y.val when 0 then y.ts end) as stop
from (
select sd, min(case val when 1 then ts end) as start
from t_pivot
group by sd
) as x
join t_pivot as y
on x.sd = y.sd
and y.ts > x.start
group by x.sd, x.start
) as z
order by sd desc;
+---------------+----------+---------------------+---------------------+
| sd | duration | start | stop |
+---------------+----------+---------------------+---------------------+
| EngOilP_sd | 00:15:22 | 2015-06-24 20:30:20 | 2015-06-24 20:45:42 |
| CompOilLVL_sd | 00:04:18 | 2015-06-24 20:49:59 | 2015-06-24 20:54:17 |
+---------------+----------+---------------------+---------------------+
2 rows in set (0.00 sec)
I'm still kind of confused at what you want, but I think you want this..
SELECT client, provider, sender, min(created_at) AS first_created_at
FROM (
SELECT count(CASE WHEN is_reset THEN 1 END) OVER (ORDER BY created_at) AS grp,
client,
provider,
sender,
created_at
FROM (
SELECT
(client,provider,sender) <> lag( (client,provider,sender) ) OVER (ORDER BY created_at) AS is_reset,
client,
provider,
sender,
created_at
FROM messages
) AS t1
) AS t2
GROUP BY grp, client, provider, sender
HAVING count(*) > 1;
Breaking that down we first create an is_reset
using PostgreSQL's row-comparison feature. We could have just written this other ways.
SELECT
(client,provider,sender) <> lag( (client,provider,sender) ) OVER (ORDER BY created_at) AS is_reset,
client,
provider,
sender,
created_at
FROM messages
is_reset | client | provider | sender | created_at
----------+--------+----------+--------+------------------------
| 1 | 2 | 1 | 2017-06-13 12:00:00-05
f | 1 | 2 | 1 | 2017-06-13 12:00:10-05
f | 1 | 2 | 1 | 2017-06-13 12:01:00-05
t | 3 | 2 | 2 | 2017-06-13 12:05:00-05
t | 1 | 3 | 1 | 2017-06-13 12:05:00-05
f | 1 | 3 | 1 | 2017-06-13 12:05:10-05
t | 3 | 2 | 2 | 2017-06-13 12:05:10-05
f | 3 | 2 | 2 | 2017-06-13 12:05:20-05
t | 1 | 2 | 2 | 2017-06-13 12:10:00-05
f | 1 | 2 | 2 | 2017-06-13 12:10:10-05
f | 1 | 2 | 2 | 2017-06-13 12:10:20-05
t | 1 | 2 | 1 | 2017-06-13 12:11:00-05
t | 1 | 2 | 2 | 2017-06-13 12:11:10-05
t | 3 | 2 | 3 | 2017-06-13 12:12:00-05
t | 3 | 2 | 2 | 2017-06-13 12:12:10-05
f | 3 | 2 | 2 | 2017-06-13 12:12:15-05
t | 3 | 2 | 3 | 2017-06-13 12:12:30-05
f | 3 | 2 | 3 | 2017-06-13 12:14:00-05
(18 rows)
Then we count()
to get groups.
SELECT count(CASE WHEN is_reset THEN 1 END) OVER (ORDER BY created_at) AS grp,
client,
provider,
sender,
created_at
FROM (
SELECT
(client,provider,sender) <> lag( (client,provider,sender) ) OVER (ORDER BY created_at) AS is_reset,
client,
provider,
sender,
created_at
FROM messages
) AS t1;
grp | client | provider | sender | created_at
-----+--------+----------+--------+------------------------
0 | 1 | 2 | 1 | 2017-06-13 12:00:00-05
0 | 1 | 2 | 1 | 2017-06-13 12:00:10-05
0 | 1 | 2 | 1 | 2017-06-13 12:01:00-05
2 | 3 | 2 | 2 | 2017-06-13 12:05:00-05
2 | 1 | 3 | 1 | 2017-06-13 12:05:00-05
3 | 1 | 3 | 1 | 2017-06-13 12:05:10-05
3 | 3 | 2 | 2 | 2017-06-13 12:05:10-05
3 | 3 | 2 | 2 | 2017-06-13 12:05:20-05
4 | 1 | 2 | 2 | 2017-06-13 12:10:00-05
4 | 1 | 2 | 2 | 2017-06-13 12:10:10-05
4 | 1 | 2 | 2 | 2017-06-13 12:10:20-05
5 | 1 | 2 | 1 | 2017-06-13 12:11:00-05
6 | 1 | 2 | 2 | 2017-06-13 12:11:10-05
7 | 3 | 2 | 3 | 2017-06-13 12:12:00-05
8 | 3 | 2 | 2 | 2017-06-13 12:12:10-05
8 | 3 | 2 | 2 | 2017-06-13 12:12:15-05
9 | 3 | 2 | 3 | 2017-06-13 12:12:30-05
9 | 3 | 2 | 3 | 2017-06-13 12:14:00-05
(18 rows)
And, now we GROUP BY grp
, and make sure there are at least two messages (with HAVING
) which I assume takes care of your, "compare the time from the first message of a group of messages of client to next message of the provider".
I'm not sure of your other criteria or what you want. You did a great job showing data, perhaps you should continue the trend and show what you want to exclude on your data. Also, please, in the future, draw out a desired result form that data if you can.
Best Answer
So the trick here is a property of two equally incrementing series which produce a difference that can be used to identify islands
{11,12,13} - {1,2,3} = {10,10,10}
. This property isn't enough to identify islands in and of itself, but it's a crucial step that we can exploit to do so.Background
Stepping aside from the problem. Let's check this out.. Here we
Here is some code.
This looks pretty good, and grouping by the
difference
would be good enough in this example. You can see we have three groups starting at(2,42)
,(13,7)
, and(42,2)
, corresponding to the groups inxoffsets
. This is essentially the problem in reverse. But we have one major issue because we're demonstrating this with static offsets. If the difference between any two offsetso1-o2
is the same we'll have a problem. Like this,We'll we have to find a way define the second offset statically.
And, again we're back on track to making groups for each pairs of offsets. This isn't quite what we're doing, but it's pretty close and hopefully it serves to help illustrate how the two sets can be subtracted to create islands.
Application
Now let's revisit the problem above with table
foo
. We sandwich the variables between two copies ofx
for display purposes only.You can see here all of the variables at play, in addition to
x
andid
lhs
is simple. We're just generating a unique sequential identifier with that (because we're feeding it a unique sequential identifier:id
-- though never forget that ids are seldom gapless)rhs
is slightly more complex. We partition byx
and generate a sequential identifier with that amongst the differentx
values. Observe howrhs
increments over the set each time it sees a row with a value that it has already seen. The property ofrhs
is how many times the value was seen.diff
is the simple result of subtraction but it's not too useful to think of it like that. Think of more like subtracting a sequence than digits (though they're digits for any single row). We have a sequence increasing by one for the amount of times a value is seen. And, we have a sequence increasing by one for every distinct id (every time). Subtracting these two sequences will return the same number for repeating values, just like in our example above.(11,12,13) - (1,2,3) = (10,10,10)
. This is the same principle in the first section of this answerdiff
doesn't independently mark the group by itselfdiff
forces all identical islands ofx
into a group (which may have false-positives, ex. the three cases wherediff=3
above and their correspondingx
values)The group
grp
is a function of(x, diff)
. It serves as the grouping id, albeit in a slightly weird format. This serves to reduce false positives that would happen if we just grouped by diff.Simple Unoptimized Query
So now we have our simple unoptimized query
Optimizations
On the matter of replacing
dense_rank()
with something else, likerow_number()
. @ypercube commentedSo let's review it, here is a query that shows
row_number()
anddense_rank()
1,2,3,4,5,6,7,8,6,7,8
, and two different "islands" ofx
vals.SQL Query,
Result set,
dr
dense_rank,rn
row_numberYou'll see here, that when the column you're ordering by has duplicates you can't use this method because you can't ensure an ordering for a query that has duplicate in the
ORDER BY
clause. It's very much just the effect of the ordering of the duplicates that throws off the difference: the relationship of the global incremeneting value to the incrementing value over the partition. However, when you have an unique id column, or a series of columns that define uniqueness, by all means userow_number()
instead ofdense_rank()
.