Postgresql – Group by arbitary monthly time period

database-agnosticgroup byhsqldbpostgresql

I want to group the following data by an user defined period:

+------------+--------+
|    DATE    | Amount |
+------------+--------+
| 2019-03-12 |   300  |
| 2019-03-15 |  1500  |
| 2019-03-25 |  2500  |
| 2019-03-25 |  3000  |
| 2019-04-04 |  5000  |
| 2019-04-27 | 10000  |
+------------+--------+

However, the start and end of the period does not have to align with the start and end of a calendar month, e.g.:

User A: period start first of month, period end last of month

User B: period start 15th of the month, period end 14th of the following month

User C: period start 7th of the month, period end 6th of the following month

So:

User A:

+--------------+--------+
| Period Start | Amount |
+--------------+--------+
|  2019-01-01  |    0   |
|  2019-02-01  |    0   |
|  2019-03-01  |  7300  |
|  2019-04-01  | 15000  | 
|  2019-05-01  |    0   |
|  2019-06-01  |    0   |
|     ...      |   ...  |
+--------------+--------+

User B:

+--------------+--------+
| Period Start | Amount |
+--------------+--------+
|  2019-01-15  |    0   |
|  2019-02-15  |   300  |
|  2019-03-15  | 12000  |
|  2019-04-15  | 10000  | 
|  2019-05-15  |    0   |
|  2019-06-15  |    0   |
|     ...      |   ...  |
+--------------+--------+

User C:

+--------------+--------+
| Period Start | Amount |
+--------------+--------+
|  2019-01-07  |    0   |
|  2019-02-07  |    0   |
|  2019-03-07  | 12300  |
|  2019-04-07  | 10000  | 
|  2019-05-07  |    0   |
|  2019-06-07  |    0   |
|     ...      |   ...  |
+--------------+--------+

Can this be done in a vendor-agnostic way? If not please show me how it can be done in PostgreSQL 10.x and – if possible – in HSQLDB 2.4.x.

Best Answer

There's a number of different ways to do this, but if you only need it to be a monthly period starting on a given date, I would use a variable for the day of the month to start on and just calculate a new column with that month. Something like:

DECLARE @DayStart TINYINT

WITH cteMonth AS 
(
SELECT DateColumn, 
CASE 
WHEN @DayStart < DAY(DateColumn) THEN CASE WHEN MONTH(DateColumn) = 1 THEN CAST(YEAR(DateColumn) - 1 AS VARCHAR(4)) + '/12' ELSE CAST(YEAR(DateColumn) AS VARCHAR(4)) + '/' + CAST(MONTH(DateColumn) - 1 AS VARCHAR(2)) END
ELSE CAST(YEAR(DateColumn) AS VARCHAR(4)) + '/' + CAST(MONTH(DateColumn) AS VARCHAR(2))
END AS DateMonth
FROM SomeTable
)
SELECT DateMonth, [aggregate columns]
FROM cteMonth
GROUP BY DateMonth

Apologies for this being T-SQL syntax (I'm not a PL/SQL guy), but in case you need help on declaring variables: http://www.postgresqltutorial.com/plpgsql-variables/

Using `tsrange`

I like tsrange. It's certainly not the only way to do this, but it is not error prone and it's easy to read and write.

SELECT
  grp.range,
  sum(
    EXTRACT(epoch FROM least(upper(grp.range),datahora_fim))
    - EXTRACT(epoch FROM greatest(lower(grp.range),datahora_ini))
  )
FROM (
  SELECT
    date_trunc('hour', min(datahora_ini)),
    date_trunc('hour', max(datahora_fim))
  FROM login
) AS bounds(min,max)
CROSS JOIN LATERAL generate_series(min, max, '1 hour') AS gs(start)
CROSS JOIN LATERAL tsrange(gs.start, gs.start + '1 hour') AS grp(range)
JOIN login ON grp.range && tsrange(datahora_ini,datahora_fim)
GROUP BY range
ORDER BY range;
                     range                     |  sum  
-----------------------------------------------+-------
 ["2017-06-02 08:00:00","2017-06-02 09:00:00") | 17821
 ["2017-06-02 09:00:00","2017-06-02 10:00:00") | 18000
 ["2017-06-02 10:00:00","2017-06-02 11:00:00") | 18000
 ["2017-06-02 11:00:00","2017-06-02 12:00:00") | 17079
 ["2017-06-02 12:00:00","2017-06-02 13:00:00") | 14400
 ["2017-06-02 13:00:00","2017-06-02 14:00:00") | 14363
 ["2017-06-02 14:00:00","2017-06-02 15:00:00") |  3716
 ["2017-06-02 15:00:00","2017-06-02 16:00:00") |  3600
 ["2017-06-02 16:00:00","2017-06-02 17:00:00") |   833
(9 rows)

The first part generates the ranges for the data

SELECT bounds.*, grp.*
FROM (
  SELECT
    date_trunc('hour', min(datahora_ini)),
    date_trunc('hour', max(datahora_fim))
  FROM login
) AS bounds(min,max)
CROSS JOIN LATERAL generate_series(min, max, '1 hour') AS gs(start)
CROSS JOIN LATERAL tsrange(gs.start, gs.start + '1 hour') AS grp(range)
ORDER BY range;
         min         |         max         |                     range                     
---------------------+---------------------+-----------------------------------------------
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 08:00:00","2017-06-02 09:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 09:00:00","2017-06-02 10:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 10:00:00","2017-06-02 11:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 11:00:00","2017-06-02 12:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 12:00:00","2017-06-02 13:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 13:00:00","2017-06-02 14:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 14:00:00","2017-06-02 15:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 15:00:00","2017-06-02 16:00:00")
 2017-06-02 08:00:00 | 2017-06-02 17:00:00 | ["2017-06-02 16:00:00","2017-06-02 17:00:00")
(9 rows)

The second part joins them back to the original data, and

Pulls the seconds since epoch on the least of the high-point on the range and the datahora_fim. The ceiling for the range is the range itself.
Pulls the seconds since epoch on the greatest of the low point on the range, and datahora_ini. The floor for the range is the range itself.
Subtracts the two to get the difference in seconds
Sums it up.

That looks like this,

sum(
  EXTRACT(epoch FROM least(upper(grp.range),datahora_fim))
  - EXTRACT(epoch FROM greatest(lower(grp.range),datahora_ini))
)

This method can use a functional index on tsrange(datahora_ini,datahora_fim).

Your original query just shows the lower part of the tsrange, if you prefer that just use lower(grp.range)

Best Answer

Related Solutions

Postgresql – Group Database Entries by time difference

Extracting Sum of Time in a Period in PostgreSQL

Using tsrange

Related Question

Using `tsrange`