PostgreSQL: Generate a series of dates for each group in a table

group bypostgresql

I have a balances table in PostgreSQL 9.3 that looks like this:

CREATE TABLE balances (
  user_id INT
, balance INT
, as_of_date DATE
);

INSERT INTO balances (user_id, balance, as_of_date) VALUES
  (1, 100, '2016-01-03')
, (1,  50, '2016-01-02')
, (1,  10, '2016-01-01')
, (2, 200, '2016-01-01')
, (3,  30, '2016-01-03');

It only contains balances for dates that a user has made a transaction. I need it to contain a row for each user with their balance on each date in a given date range.

If the user does not have a row for a given date in the range, I need to use their balance from the day before.
If the user created their account after a given date in the range, I need to avoid creating a row for that user/date combination.

I can reference an accounts table to get users' create_date:

CREATE TABLE accounts (
  user_id INT
, create_date DATE
);

INSERT INTO accounts (user_id, create_date) VALUES
  (1, '2015-12-01')
, (2, '2015-12-31')
, (3, '2016-01-03');

My desired result looks like this:

+---------+---------+--------------------------+
| user_id | balance |        as_of_date        |
+---------+---------+--------------------------+
|       1 |     100 | 2016-01-03T00:00:00.000Z |
|       1 |      50 | 2016-01-02T00:00:00.000Z |
|       1 |      10 | 2016-01-01T00:00:00.000Z |
|       2 |     200 | 2016-01-03T00:00:00.000Z |
|       2 |     200 | 2016-01-02T00:00:00.000Z |
|       2 |     200 | 2016-01-01T00:00:00.000Z |
|       3 |      30 | 2016-01-03T00:00:00.000Z |
+---------+---------+--------------------------+

Note that rows have been added for user 2 for 2016-01-02 and 2016-01-03, carrying over the previous balance from 2016-01-01; and that no rows have been added for user 3, who was created on 2016-01-03.

To generate a series of dates in a date range, I know I can use:

SELECT d.date FROM GENERATE_SERIES('2016-01-01', '2016-01-03', '1 day'::INTERVAL) d

…but I am struggling with LEFT JOINing that series with each set of rows grouped by user_id.

Best Answer

1. `CROSS JOIN`, `LEFT JOIN LATERAL` to subquery

SELECT a.user_id, COALESCE(b.balance, 0) AS balance, d.as_of_date
FROM   (
   SELECT d::date AS as_of_date  -- cast to date right away
   FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
   ) d
JOIN   accounts a ON a.create_date <= d.as_of_date
LEFT   JOIN LATERAL (
   SELECT balance
   FROM   balances
   WHERE  user_id = a.user_id
   AND    as_of_date <= d.as_of_date
   ORDER  BY as_of_date DESC
   LIMIT  1
   ) b ON true
ORDER  BY a.user_id, d.as_of_date;

Returns your desired result - except that as_of_date is an actual date, not a timestamp like in your example. That should be more appropriate.

Users that are created already, but don't have any transactions, yet, are listed with a balance of 0. You did not define how to deal with the corner case.

Rather use timestamp input for generate_series():

Generating time series between two dates in PostgreSQL

It's crucial for performance that you back this up with a multicolumn index:

CREATE INDEX balances_multi_idx ON balances (user_id, as_of_date DESC, balance);

We have had a very similar case on SO just this week:

Aggregating the most recent joined records per week

Find more explanation there.

2. `CROSS JOIN`, `LEFT JOIN` , window functions

SELECT user_id
     , COALESCE(max(balance) OVER (PARTITION BY user_id, grp
                                   ORDER BY as_of_date), 0) AS balance
     , as_of_date
FROM  (
   SELECT a.user_id, b.balance, d.as_of_date
        , count(b.user_id) OVER (PARTITION BY user_id ORDER BY as_of_date) AS grp
   FROM   (
      SELECT d::date AS as_of_date  -- cast to date right away
      FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
      ) d
   JOIN   accounts a ON a.create_date <= d.as_of_date
   LEFT   JOIN balances b USING (user_id, as_of_date)
   ) sub
ORDER  BY user_id, as_of_date;

Same result. If you have the multicolumn index mentioned above and can get index-only scans out of it, the first solution is most probably faster.

The main feature is the running count of values to form groups. since count() does not count NULL values, all dates without balance fall into the same group (grp) as the most recent balance. Then use a simple max() over the same window frame extended by grp to copy the last balance for dangling gaps.

Postgresql – Calculate rolling sum over 7 consecutive day period on PostgreSQL

By far the cleanest solution is to use window function sum with rows between:

with days as (
        SELECT date_trunc('day', d)::date as day
        FROM generate_series(CURRENT_DATE-31, CURRENT_DATE-1, '1 day'::interval) d ),
    counts as (
        select 
            days.day,
            sum((random()*5)::integer) num
        FROM days
        -- left join other tables here to get counts, I'm using random
        group by days.day
    )
select
    day,
    num,
    sum(num) over (order by day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
from counts
order by day;

The important part is to generate the timeframe in days CTE and join onto it in order not to miss any days for which there is no data.

Example

For example, if I create some test data with 20 records in last 14 days:

SELECT (current_date - ((random()*14)::integer::text || 'days')::interval)::date as day, (random()*7)::integer as num
into test_data from generate_series(1, 20);;

And also add a value before that:

insert into test_data values ((current_date - '25 days'::interval), 5);

Then use the query above:

with days as (
        SELECT date_trunc('day', d)::date as day
        FROM generate_series(CURRENT_DATE-31, CURRENT_DATE-1, '1 day'::interval) d ),
    counts as (
        select 
            days.day,
            sum(t.num) num
        FROM days
        left join test_data t on t.day = days.day
        group by days.day
    )
select
    day,
    num,
    sum(num) over (order by day rows between 6 preceding and current row)
from counts
order by day;

And get the results for whole month:

    day     | num | sum 
------------+-----+-----
 2016-01-31 |     |    
 2016-02-01 |     |    
 2016-02-02 |     |    
 2016-02-03 |     |    
 2016-02-04 |     |    
 2016-02-05 |     |    
 2016-02-06 |   5 |   5
 2016-02-07 |     |   5
 2016-02-08 |     |   5
 2016-02-09 |     |   5
 2016-02-10 |     |   5
 2016-02-11 |     |   5
 2016-02-12 |     |   5
 2016-02-13 |     |    
 2016-02-14 |     |    
 2016-02-15 |     |    
 2016-02-16 |     |    
 2016-02-17 |     |    
 2016-02-18 |   2 |   2
 2016-02-19 |   5 |   7
 2016-02-20 |     |   7
 2016-02-21 |   4 |  11
 2016-02-22 |  15 |  26
 2016-02-23 |   1 |  27
 2016-02-24 |   1 |  28
 2016-02-25 |   2 |  28
 2016-02-26 |   4 |  27
 2016-02-27 |   9 |  36
 2016-02-28 |   5 |  37
 2016-02-29 |  11 |  33
 2016-03-01 |   5 |  37
(31 rows)

Postgresql – How to LEFT JOIN a date table onto each customer (GROUP BY) in a customer purchase table? Can a JOIN be done for dates after initial purchase

To follow on from my comment. I'd recommend a date table.

Sample Customer Purchase Data

IF OBJECT_ID('tempdb..#CustomerPurchases') IS NOT NULL DROP TABLE #CustomerPurchases
GO
CREATE TABLE #CustomerPurchases (PurchaseDate date, CustomerName varchar(20))
INSERT INTO #CustomerPurchases (PurchaseDate, CustomerName)
VALUES
 ('2016-07-04','Jon Snow')
,('2016-07-06','Jon Snow')
,('2016-07-07','Jon Snow')
,('2016-07-07','Jon Snow')
,('2016-07-07','Jon Snow')
,('2016-07-05','Daenerys Targaryen')
,('2016-07-06','Daenerys Targaryen')
,('2016-07-09','Daenerys Targaryen')
,('2016-07-09','Daenerys Targaryen')
,('2016-07-10','Daenerys Targaryen')

Sample Date Table

IF OBJECT_ID('tempdb..#DateTable') IS NOT NULL DROP TABLE #DateTable
GO
CREATE TABLE #DateTable (DateList date)
INSERT INTO #DateTable (DateList)
VALUES
 ('2016-07-04')
,('2016-07-05')
,('2016-07-06')
,('2016-07-07')
,('2016-07-08')
,('2016-07-09')
,('2016-07-10')

You could cross join the list of dates to then get a full list of all customers and all dates. Then join to the actual sales data to return the Boolean value you're after.

SELECT
dt.DateList
,cu.CustomerName
,(CASE WHEN cp.PurchaseDate IS NULL THEN 0 ELSE 1 END) PurchaseMade
FROM #DateTable dt
CROSS JOIN  (
                SELECT DISTINCT 
                CustomerName 
                FROM #CustomerPurchases
            ) cu
LEFT JOIN   (
                SELECT DISTINCT 
                CustomerName
                ,PurchaseDate 
                FROM #CustomerPurchases
            ) cp
    ON dt.DateList = cp.PurchaseDate
    AND cu.CustomerName = cp.CustomerName

Results would look like this

DateList    CustomerName        PurchaseMade
2016-07-04  Daenerys Targaryen  0
2016-07-05  Daenerys Targaryen  1
2016-07-06  Daenerys Targaryen  1
2016-07-07  Daenerys Targaryen  0
2016-07-08  Daenerys Targaryen  0
2016-07-09  Daenerys Targaryen  1
2016-07-10  Daenerys Targaryen  1
2016-07-04  Jon Snow            1
2016-07-05  Jon Snow            0
2016-07-06  Jon Snow            1
2016-07-07  Jon Snow            1
2016-07-08  Jon Snow            0
2016-07-09  Jon Snow            0
2016-07-10  Jon Snow            0

If you were to do this then your date table would obviously be much wider than the week example I've done. You could then just use date parameters to restrict to the date range you're after. Something like this;

DECLARE @StartDate date; SET @StartDate = '2016-07-05'
DECLARE @EndDate date; SET @EndDate = '2016-07-08'
SELECT
dt.DateList
,cu.CustomerName
,(CASE WHEN cp.PurchaseDate IS NULL THEN 0 ELSE 1 END) PurchaseMade
FROM #DateTable dt
CROSS JOIN  (
                SELECT DISTINCT 
                CustomerName 
                FROM #CustomerPurchases
            ) cu
LEFT JOIN   (
                SELECT DISTINCT 
                CustomerName
                ,PurchaseDate 
                FROM #CustomerPurchases
            ) cp
    ON dt.DateList = cp.PurchaseDate
    AND cu.CustomerName = cp.CustomerName
WHERE dt.DateList BETWEEN @StartDate AND @EndDate
ORDER BY CustomerName, DateList

Which would give these results

DateList    CustomerName        PurchaseMade
2016-07-05  Daenerys Targaryen  1
2016-07-06  Daenerys Targaryen  1
2016-07-07  Daenerys Targaryen  0
2016-07-08  Daenerys Targaryen  0
2016-07-05  Jon Snow            0
2016-07-06  Jon Snow            1
2016-07-07  Jon Snow            1
2016-07-08  Jon Snow            0

Best Answer

1. CROSS JOIN, LEFT JOIN LATERAL to subquery

2. CROSS JOIN, LEFT JOIN , window functions

Related Solutions

Postgresql – Calculate rolling sum over 7 consecutive day period on PostgreSQL

Postgresql – How to LEFT JOIN a date table onto each customer (GROUP BY) in a customer purchase table? Can a JOIN be done for dates after initial purchase

Related Question

1. `CROSS JOIN`, `LEFT JOIN LATERAL` to subquery

2. `CROSS JOIN`, `LEFT JOIN` , window functions