PostgreSQL: Generate a series of dates for each group in a table

group bypostgresql

I have a balances table in PostgreSQL 9.3 that looks like this:

CREATE TABLE balances (
  user_id INT
, balance INT
, as_of_date DATE
);

INSERT INTO balances (user_id, balance, as_of_date) VALUES
  (1, 100, '2016-01-03')
, (1,  50, '2016-01-02')
, (1,  10, '2016-01-01')
, (2, 200, '2016-01-01')
, (3,  30, '2016-01-03');

It only contains balances for dates that a user has made a transaction. I need it to contain a row for each user with their balance on each date in a given date range.

  • If the user does not have a row for a given date in the range, I need to use their balance from the day before.
  • If the user created their account after a given date in the range, I need to avoid creating a row for that user/date combination.

I can reference an accounts table to get users' create_date:

CREATE TABLE accounts (
  user_id INT
, create_date DATE
);

INSERT INTO accounts (user_id, create_date) VALUES
  (1, '2015-12-01')
, (2, '2015-12-31')
, (3, '2016-01-03');

My desired result looks like this:

+---------+---------+--------------------------+
| user_id | balance |        as_of_date        |
+---------+---------+--------------------------+
|       1 |     100 | 2016-01-03T00:00:00.000Z |
|       1 |      50 | 2016-01-02T00:00:00.000Z |
|       1 |      10 | 2016-01-01T00:00:00.000Z |
|       2 |     200 | 2016-01-03T00:00:00.000Z |
|       2 |     200 | 2016-01-02T00:00:00.000Z |
|       2 |     200 | 2016-01-01T00:00:00.000Z |
|       3 |      30 | 2016-01-03T00:00:00.000Z |
+---------+---------+--------------------------+

Note that rows have been added for user 2 for 2016-01-02 and 2016-01-03, carrying over the previous balance from 2016-01-01; and that no rows have been added for user 3, who was created on 2016-01-03.

To generate a series of dates in a date range, I know I can use:

SELECT d.date FROM GENERATE_SERIES('2016-01-01', '2016-01-03', '1 day'::INTERVAL) d

…but I am struggling with LEFT JOINing that series with each set of rows grouped by user_id.

Best Answer

1. CROSS JOIN, LEFT JOIN LATERAL to subquery

SELECT a.user_id, COALESCE(b.balance, 0) AS balance, d.as_of_date
FROM   (
   SELECT d::date AS as_of_date  -- cast to date right away
   FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
   ) d
JOIN   accounts a ON a.create_date <= d.as_of_date
LEFT   JOIN LATERAL (
   SELECT balance
   FROM   balances
   WHERE  user_id = a.user_id
   AND    as_of_date <= d.as_of_date
   ORDER  BY as_of_date DESC
   LIMIT  1
   ) b ON true
ORDER  BY a.user_id, d.as_of_date;

Returns your desired result - except that as_of_date is an actual date, not a timestamp like in your example. That should be more appropriate.

Users that are created already, but don't have any transactions, yet, are listed with a balance of 0. You did not define how to deal with the corner case.

Rather use timestamp input for generate_series():

It's crucial for performance that you back this up with a multicolumn index:

CREATE INDEX balances_multi_idx ON balances (user_id, as_of_date DESC, balance);

We have had a very similar case on SO just this week:

Find more explanation there.

2. CROSS JOIN, LEFT JOIN , window functions

SELECT user_id
     , COALESCE(max(balance) OVER (PARTITION BY user_id, grp
                                   ORDER BY as_of_date), 0) AS balance
     , as_of_date
FROM  (
   SELECT a.user_id, b.balance, d.as_of_date
        , count(b.user_id) OVER (PARTITION BY user_id ORDER BY as_of_date) AS grp
   FROM   (
      SELECT d::date AS as_of_date  -- cast to date right away
      FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
      ) d
   JOIN   accounts a ON a.create_date <= d.as_of_date
   LEFT   JOIN balances b USING (user_id, as_of_date)
   ) sub
ORDER  BY user_id, as_of_date;

Same result. If you have the multicolumn index mentioned above and can get index-only scans out of it, the first solution is most probably faster.

The main feature is the running count of values to form groups. since count() does not count NULL values, all dates without balance fall into the same group (grp) as the most recent balance. Then use a simple max() over the same window frame extended by grp to copy the last balance for dangling gaps.

Related: