I need help writing a SQL query for Redshift (Postgres will do) likely involving WINDOW functions, PARTITIONS, LAST_VALUE and other things above my head.
Users can submit survey responses at any time (usually every few months). After 6 months the score is no longer fresh/relevant.
DROP TABLE IF EXISTS users;
CREATE TABLE users (
user_id INTEGER UNIQUE,
user_name VARCHAR(50)
);
INSERT INTO users VALUES
(1, 'Steve Wozniak'),
(2, 'Steve Jobs'),
(3, 'Tony Ive');
DROP TABLE IF EXISTS responses;
CREATE TABLE responses (
response_id INTEGER UNIQUE,
user_id INTEGER,
date DATE,
score INTEGER
);
INSERT INTO responses VALUES
(1, 1, '2016-08-21', 2),
(2, 1, '2016-02-04', 8),
(3, 1, '2016-04-11', 4),
(4, 1, '2016-06-21', 10),
(5, 2, '2015-11-04', 9),
(6, 2, '2015-11-22', 8),
(7, 2, '2016-07-11', 10),
(8, 2, '2016-08-15', 2);
I would like to return a recordset grouped by month that contains the rolling, most recent score (LAST_VALUE) for each user up to that month if that response that is no older than 6 months old.
The result set would contain the following rows relating to User #1. NULL values shown for clarity but can be omitted:
+---------+------------+-------+
| User Id | Date | Score |
+---------+------------+-------+
| 1 | 2015-11-01 | NULL | <= No score submitted yet
| 1 | 2015-12-01 | NULL | <= No score submitted yet
| 1 | 2016-01-01 | NULL | <= No score submitted yet
| 1 | 2016-02-01 | NULL | <= No score submitted yet
| 1 | 2016-03-01 | 8 |
| 1 | 2016-04-01 | 8 |
| 1 | 2016-05-01 | 4 |
| 1 | 2016-06-01 | 4 |
| 1 | 2016-07-01 | 10 |
| 1 | 2016-08-01 | 10 |
| 1 | 2016-09-01 | 2 |
| 1 | 2016-10-01 | 2 |
+---------+------------+-------+
And for #2:
+---------+------------+-------+
| User Id | Date | Score |
+---------+------------+-------+
| 2 | 2015-11-01 | NULL | <= No score submitted yet
| 2 | 2015-12-01 | 8 |
| 2 | 2016-01-01 | 8 |
| 2 | 2016-02-01 | 8 |
| 2 | 2016-03-01 | 8 |
| 2 | 2016-04-01 | 8 |
| 2 | 2016-05-01 | 8 |
| 2 | 2016-06-01 | NULL | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2 | 2016-07-01 | NULL | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2 | 2016-08-01 | 10 |
| 2 | 2016-09-01 | 2 |
| 2 | 2016-10-01 | 2 |
+---------+------------+-------+
-
All 12 months (or entire series) should be populated unless all NULL. Possibly use generate_series() in postgres or using a number table in Redshift (https://www.periscopedata.com/blog/generate-series-in-redshift-and-mysql.html)
-
The NULL values can be omitted (shown for clarity).
Ultimately I'm going to need to reproduce the same grouped by year and week of the year but I can probably extrapolate those queries if I can figure out this.
Best Answer
2015-11-22 is not older than 6 months compared to 2015-05-01 unless you truncate the response date to month also.
So depending the logic what you consider expired the query changes a bit. I've included both version. In comment the date and active condition as per the example.
Using Lateral join:
Using Window functions: