Postgresql – Group by with a break on date

group bypostgresqlwindow functions

I am trying to group data on homogenous periods of time (not sure this is clear).
Here is a simplified example of my data:

CREATE TABLE workhours_over_time (
             worker varchar,
             workhours integer,
             otherinfotobeignored varchar,
             startofperiod date,
             endofperiod date);

INSERT INTO workhours_over_time (worker, workhours, 
            otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '35', 'any info', '2020-01-01','2020-01-31');

INSERT INTO workhours_over_time (worker, workhours, 
            otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '35', 'any other info', '2020-02-01','2020-02-28');

INSERT INTO workhours_over_time (worker, workhours,
            otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '39', 'any info', '2020-03-01','2020-04-15');

INSERT INTO workhours_over_time (worker, workhours,
            otherinfotobeignored, startofperiod, endofperiod)
VALUES ('W1', '35', 'any info', '2020-04-16','2111-11-11');

So here is the result:

worker|workhours|otherinfotobeignored|startofperiod|endofperiod|
------|---------|--------------------|-------------|-----------|
W1    |       35|any info            |   2020-01-01| 2020-01-31|
W1    |       35|any other info      |   2020-02-01| 2020-02-28|
W1    |       39|any info            |   2020-03-01| 2020-04-15|
W1    |       35|any info            |   2020-04-16| 2111-11-11|

The result I need to obtain would group data in 3 periods: from 2020-01-01 to 2020-02-28, worker W1 works 35 hours a week. Then from 2020-03-01 to 2020-04-15 W1 works 39 hours a week. Then from 2020-04-16 to end of time, back to 35 hours.

worker|workhours|startofperiod|endofperiod|
------|---------|-------------|-----------|
W1    |       35|   2020-01-01| 2020-02-28|
W1    |       39|   2020-03-01| 2020-04-15|
W1    |       35|   2020-04-16| 2111-11-11|

I had first naive try with a simple group by:

select worker, workhours, min(startofperiod), max(endofperiod)
from workhours_over_time
group by worker, workhours

but of course, first and third period where grouped into a 2020-01-01 to 2111-11-11 period, which is not what I want.

Then I have toyed with window function, read on groups and island, but didn't find the solution…

Any idea ?

Best Answer

testdb=# create table t(k text, h integer, s date, e date);
CREATE TABLE
testdb=# insert into t select 'w1', 35, '2020-01-01', '2020-01-31';
INSERT 0 1
testdb=# insert into t select 'w1', 35, '2020-02-01', '2020-02-28';
INSERT 0 1
testdb=# insert into t select 'w1', 39, '2020-03-01', '2020-04-15';
INSERT 0 1
testdb=# insert into t select 'w1', 35, '2020-04-16', '2021-12-31';
INSERT 0 1
testdb=# select * from t;
 k  | h  |     s      |     e      
----+----+------------+------------
 w1 | 35 | 2020-01-01 | 2020-01-31
 w1 | 35 | 2020-02-01 | 2020-02-28
 w1 | 39 | 2020-03-01 | 2020-04-15
 w1 | 35 | 2020-04-16 | 2021-12-31
(4 rows)

testdb=# select row_number() over (partition by k order by s) r1,
                row_number() over (partition by k, h order by s) r2,
                t.* from t;
 r1 | r2 | k  | h  |     s      |     e      
----+----+----+----+------------+------------
  1 |  1 | w1 | 35 | 2020-01-01 | 2020-01-31
  2 |  2 | w1 | 35 | 2020-02-01 | 2020-02-28
  3 |  1 | w1 | 39 | 2020-03-01 | 2020-04-15
  4 |  3 | w1 | 35 | 2020-04-16 | 2021-12-31

testdb=# select (h, row_number() over (partition by k order by s) -
    row_number() over (partition by k, h order by s)) as group_key, t.* from t;
 group_key | k  | h  |     s      |     e      
-----------+----+----+------------+------------
 (35,0)    | w1 | 35 | 2020-01-01 | 2020-01-31
 (35,0)    | w1 | 35 | 2020-02-01 | 2020-02-28
 (39,2)    | w1 | 39 | 2020-03-01 | 2020-04-15
 (35,1)    | w1 | 35 | 2020-04-16 | 2021-12-31
(4 rows)

testdb=# with q0 as (
    select (h, row_number() over (partition by k order by s)
              -row_number() over (partition by k, h order by s)
              ) AS group_key,
           t.* from t)
select k, h,
       min(s) as s,
       max(e) as e
from q0 group by group_key, k, h order by min(s);
 k  | h  |     s      |     e
----+----+------------+------------
 w1 | 35 | 2020-01-01 | 2020-02-28
 w1 | 39 | 2020-03-01 | 2020-04-15
 w1 | 35 | 2020-04-16 | 2021-12-31

Why does this work? In the first select with row_number, partitioning by by both k (the worker column in your question, just shortened for simplicity/brevity) and h gives independently-increasing sequences for each h value for each worker. You then just subtract the worker's overall row number sequence. Consecutive rows whose h values are the same will have both sequences increasing by 1, so subtracting non-hours-worked-scoped row number will give the same value for both rows. Using both h and that subtracted value gives us a key that identifies each contiguous set of rows for each hours-worked for each worker; that's the group_key column in the next query.

At that point, it's a simple GROUP BY of group_key, and min/max the start and end. And that's it.

Related Solutions

How to Improve Union for Consistent Results in PostgreSQL

Your query can be simplified in various respects:

SELECT to_char(day, 'DD/MM/YYYY') AS date
     , hashtag
     , count(d.*)::int AS count
FROM  (
       campaign c
CROSS  JOIN json_array_elements_text(c.configuration#>'{query,hashtags}') ch(hashtag)
CROSS  JOIN (SELECT g::date AS day
             FROM generate_series(timestamp '2016-01-20', '2016-01-26', interval '1 day') g) day 
      )
NATURAL LEFT JOIN (
   SELECT "createdAt"::date AS day, dh.hashtag
   FROM   data, json_array_elements_text(meta#>'{matchedOn,hashtags}') dh(hashtag)
   WHERE  "campaignId" = 1
   AND    "createdAt" >= '2016-01-20'
   AND    "createdAt" <  '2016-01-27'
   ) d
WHERE  c.id = 1
GROUP  BY day, hashtag
ORDER  BY day, hashtag, count;

This should be faster for multiple reasons. Not least it can use a multi-column index on data("campaignId", "createdAt") - which you should create unless you have it.

LEFT JOIN is the core feature you need. Then count the column, only non-null values count ...

One of the very rare cases where a NATURAL JOIN is useful, btw. You don't need it, though. It's just a minor syntax shortcut.

And rather use timestamp input for generate_series():

Generating time series between two dates in PostgreSQL

How to Select Last Non-NULL Values per Column in SQL Server

Using the concatenation technique from The Last non NULL Puzzle by Itzik Ben Gan would look like this with your sample table and column data types.

select T.cat,
       cast(substring(
                     max(cast(T.t as binary(2)) + cast(T.val1 as binary(4))),
                     3,
                     4
                     ) as int),
       cast(substring(
                     max(cast(T.t as binary(2)) + cast(T.val2 as binary(1))),
                     3,
                     1
                     ) as char(1))
from #tab as T
group by T.cat;

Another way to write this query that divides the steps into CTE's to perhaps better show what is going on. It gives the exact same execution plan as the query above.

with C1 as
(
  -- Concatenate the ordering column with the value column
  select T.cat,
        cast(T.t as binary(2)) + cast(T.val1 as binary(4)) as val1,
        cast(T.t as binary(2)) + cast(T.val2 as binary(1)) as val2
  from #tab as T
),
C2 as
(
  -- Get the max concatenated value per group
  select C1.cat,
         max(C1.val1) as val1,
         max(C1.val2) as val2
  from C1
  group by C1.cat
)
-- Extract the value from the concatenated column
select C2.cat,
       cast(substring(C2.val1, 3, 4) as int) as val1,
       cast(substring(C2.val2, 3, 1) as char(1)) as val2
from C2;

This solution uses the fact that concatenating a null value with something results in a null value. SET CONCAT_NULL_YIELDS_NULL (Transact-SQL)

Best Answer

Related Solutions

How to Improve Union for Consistent Results in PostgreSQL

How to Select Last Non-NULL Values per Column in SQL Server

Related Question