I am trying to group data on homogenous periods of time (not sure this is clear).
Here is a simplified example of my data:
CREATE TABLE workhours_over_time (
worker varchar,
workhours integer,
otherinfotobeignored varchar,
startofperiod date,
endofperiod date);
INSERT INTO workhours_over_time (worker, workhours,
otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '35', 'any info', '2020-01-01','2020-01-31');
INSERT INTO workhours_over_time (worker, workhours,
otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '35', 'any other info', '2020-02-01','2020-02-28');
INSERT INTO workhours_over_time (worker, workhours,
otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '39', 'any info', '2020-03-01','2020-04-15');
INSERT INTO workhours_over_time (worker, workhours,
otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '35', 'any info', '2020-04-16','2111-11-11');
So here is the result:
worker|workhours|otherinfotobeignored|startofperiod|endofperiod|
------|---------|--------------------|-------------|-----------|
W1 | 35|any info | 2020-01-01| 2020-01-31|
W1 | 35|any other info | 2020-02-01| 2020-02-28|
W1 | 39|any info | 2020-03-01| 2020-04-15|
W1 | 35|any info | 2020-04-16| 2111-11-11|
The result I need to obtain would group data in 3 periods: from 2020-01-01 to 2020-02-28, worker W1 works 35 hours a week. Then from 2020-03-01 to 2020-04-15 W1 works 39 hours a week. Then from 2020-04-16 to end of time, back to 35 hours.
worker|workhours|startofperiod|endofperiod|
------|---------|-------------|-----------|
W1 | 35| 2020-01-01| 2020-02-28|
W1 | 39| 2020-03-01| 2020-04-15|
W1 | 35| 2020-04-16| 2111-11-11|
I had first naive try with a simple group by:
select worker, workhours, min(startofperiod), max(endofperiod)
from workhours_over_time
group by worker, workhours
but of course, first and third period where grouped into a 2020-01-01 to 2111-11-11 period, which is not what I want.
Then I have toyed with window function, read on groups and island, but didn't find the solution…
Any idea ?
Best Answer
Why does this work? In the first select with row_number, partitioning by by both k (the worker column in your question, just shortened for simplicity/brevity) and h gives independently-increasing sequences for each h value for each worker. You then just subtract the worker's overall row number sequence. Consecutive rows whose h values are the same will have both sequences increasing by 1, so subtracting non-hours-worked-scoped row number will give the same value for both rows. Using both h and that subtracted value gives us a key that identifies each contiguous set of rows for each hours-worked for each worker; that's the group_key column in the next query.
At that point, it's a simple GROUP BY of group_key, and min/max the start and end. And that's it.