How to join sparse events for individual users on a “filled in” date ranges in Redshift

redshift

Let's say I have a table that has a record of every change a customer has made to their subscription for an internet service:

+-----+------------+-----------+------------+
| _id | account_id | plan_name | created_at |
+-----+------------+-----------+------------+
|   1 | a          | basic     | 10-01-19   |
|   2 | b          | standard  | 10-15-19   |
|   3 | c          | free      | 10-16-19   |
|   4 | a          | standard  | 10-16-19   |
|   5 | b          | free      | 10-20-19   |
+-----+------------+-----------+------------+

I'd like to perform an analysis on this data so that I can see the count of plans for every day, from 10-01-19 through 10-20-19.

To perform this analysis (among other things) I'd like to create a query that will give me a table with a row for every date for every user from their first recorded plan_name through present day (let's say this is 10-20-19). It might look something like this:

+-----+----------+------+-----------+
| _id |   date   | user | plan_name |
+-----+----------+------+-----------+
|   1 | 10-1-19  | a    | basic     |
|   2 | 10-2-19  | a    | basic     |
|   3 | 10-3-19  | a    | basic     |
|   4 | ...      | ...  | ...       |
|   5 | 10-15-19 | a    | standard  |
|   6 | 10-16-19 | a    | standard  |
|   7 | 10-17-19 | a    | standard  |
|   8 | 10-15-19 | b    | standard  |
|   9 | 10-16-19 | b    | standard  |
|  10 | 10-17-19 | b    | standard  |
|  11 | ...      | ...  | ...       |
|  12 | 10-20-19 | b    | free      |
+-----+----------+------+-----------+

My first test attempt at this was the following:

WITH dates AS (
        SELECT (GETDATE()::DATE - ROW_NUMBER() OVER (ORDER BY TRUE))::date AS date
      FROM pg_catalog.pg_operator
      LIMIT 500)

SELECT date,
     (case when account_id is not null then account_id else lag(account_id ignore nulls) over (order by date) end) as account_id,
     (case when plan_name is not null then plan_name else lag(plan_name ignore nulls) over (order by date) end) as plan_name,
     (case when start_date is not null then start_date else lag(start_date ignore nulls) over (order by date) end) as start_date
 FROM dates
LEFT JOIN (
    select * from customer_subscriptions
    where account_id = '98asd7gf98a7') AS a
    on dates.date = trunc(a.start_date)
ORDER BY date DESC

With the caveat that I tried it on a single account_id from my subscriptions table — and it worked! My thought process was generate a date range and LEFT JOIN the subscriptions table on it at the "change over" dates and then use a lag function to fill in the NULLs

However, when i remove the where account_id = '98asd7gf98a7' it no longer works and I am left with a table that simply is a joining the start dates to a date from the range, without any of the lag filler in between.

Any thoughts on another way to approach this or changes I can make to my current approach?

Best Answer

Non-optimized solution:

WITH RECURSIVE
cte1 AS ( SELECT MIN(created_at) "date" 
          FROM test
          UNION ALL
          SELECT ("date" + INTERVAL '1 DAY')::DATE 
          FROM cte1 
          WHERE "date" < ( SELECT MAX(created_at)
                           FROM test ) ),
cte2 AS ( SELECT DISTINCT account_id
          FROM test ),
cte3 AS ( SELECT cte1."date", cte2.account_id, MAX(test.created_at) created_at
          FROM cte1
          CROSS JOIN cte2
          LEFT JOIN test ON cte1."date" >= test.created_at
                        AND cte2.account_id = test.account_id
          GROUP BY cte1."date", cte2.account_id )
SELECT cte3."date", cte3.account_id "user", test.plan_name
FROM cte3 
LEFT JOIN test ON cte3.account_id = test.account_id
              AND cte3.created_at = test.created_at
ORDER BY 2, 1

fiddle