PostgreSQL – How to Select Rolling Most Recent Value for Each Person for Each Month

postgresqlredshiftwindow functions

I need help writing a SQL query for Redshift (Postgres will do) likely involving WINDOW functions, PARTITIONS, LAST_VALUE and other things above my head.

Users can submit survey responses at any time (usually every few months). After 6 months the score is no longer fresh/relevant.

DROP TABLE IF EXISTS users;
CREATE TABLE users (
    user_id       INTEGER UNIQUE,
    user_name     VARCHAR(50)
);
INSERT INTO users VALUES
    (1, 'Steve Wozniak'),
    (2, 'Steve Jobs'),
    (3, 'Tony Ive');

DROP TABLE IF EXISTS responses;
CREATE TABLE responses (
    response_id     INTEGER UNIQUE,
    user_id         INTEGER,
    date            DATE,      
    score           INTEGER
);
INSERT INTO responses VALUES
    (1, 1, '2016-08-21', 2),
    (2, 1, '2016-02-04', 8),
    (3, 1, '2016-04-11', 4),
    (4, 1, '2016-06-21', 10),
    (5, 2, '2015-11-04', 9),
    (6, 2, '2015-11-22', 8),
    (7, 2, '2016-07-11', 10),
    (8, 2, '2016-08-15', 2);

I would like to return a recordset grouped by month that contains the rolling, most recent score (LAST_VALUE) for each user up to that month if that response that is no older than 6 months old.

The result set would contain the following rows relating to User #1. NULL values shown for clarity but can be omitted:

+---------+------------+-------+
| User Id | Date       | Score |
+---------+------------+-------+
| 1       | 2015-11-01 | NULL  | <= No score submitted yet
| 1       | 2015-12-01 | NULL  | <= No score submitted yet
| 1       | 2016-01-01 | NULL  | <= No score submitted yet 
| 1       | 2016-02-01 | NULL  | <= No score submitted yet
| 1       | 2016-03-01 | 8     |
| 1       | 2016-04-01 | 8     |
| 1       | 2016-05-01 | 4     |
| 1       | 2016-06-01 | 4     |
| 1       | 2016-07-01 | 10    |
| 1       | 2016-08-01 | 10    |
| 1       | 2016-09-01 | 2     |
| 1       | 2016-10-01 | 2     |
+---------+------------+-------+

And for #2:

+---------+------------+-------+
| User Id | Date       | Score |
+---------+------------+-------+
| 2       | 2015-11-01 | NULL  | <= No score submitted yet
| 2       | 2015-12-01 | 8     |
| 2       | 2016-01-01 | 8     |  
| 2       | 2016-02-01 | 8     |
| 2       | 2016-03-01 | 8     |
| 2       | 2016-04-01 | 8     |
| 2       | 2016-05-01 | 8     |
| 2       | 2016-06-01 | NULL  | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2       | 2016-07-01 | NULL  | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2       | 2016-08-01 | 10    | 
| 2       | 2016-09-01 | 2     |
| 2       | 2016-10-01 | 2     |
+---------+------------+-------+

All 12 months (or entire series) should be populated unless all NULL. Possibly use generate_series() in postgres or using a number table in Redshift (https://www.periscopedata.com/blog/generate-series-in-redshift-and-mysql.html)
The NULL values can be omitted (shown for clarity).

Ultimately I'm going to need to reproduce the same grouped by year and week of the year but I can probably extrapolate those queries if I can figure out this.

Best Answer

2015-11-22 is not older than 6 months compared to 2015-05-01 unless you truncate the response date to month also.

> select '2015-11-22'::date + interval '6 months';
+---------------------+
| ?column?            |
|---------------------|
| 2016-05-22 00:00:00 |
+---------------------+

> select date_trunc('month', '2015-11-22'::date) + interval '6 months';
+---------------------------+
| ?column?                  |
|---------------------------|
| 2016-05-01 00:00:00+02:00 |
+---------------------------+

So depending the logic what you consider expired the query changes a bit. I've included both version. In comment the date and active condition as per the example.

Using Lateral join:

with users as (
    -- I assume you have user table so this can be omitted
    -- first and last are used to limit the join
    select user_id, min(date) as first, max(date) as last from responses group by user_id
), boundaries as (
    select 
        date_trunc('month', min(date)) as low 
        --, date_trunc('month', max(date)) as high 
        -- If you want to use the high in generate_series as upper boundary
    from responses
)
select user_id, tick as "date", score 
from 
    generate_series(
        (select low from boundaries), 
        date_trunc('month', current_date + interval '1 month'), interval '1 month'
    ) as tick
join users on tick.tick between users.first and users.last + interval '6 months'
-- # If you omit the users CTE and you have a users table
-- cross join users
left outer join lateral (
    select score from responses 
    where users.user_id = responses.user_id 
    -- # Proper 6 months calculation
    --and responses.date between tick.tick - interval '6 months' and tick.tick 
    -- # As example showed
    and responses.date < tick.tick and date_trunc('month', responses.date) + interval '6 months' > tick.tick 
    order by responses.date desc limit 1
) a on true
order by 1, 2;

Using Window functions:

with boundaries as (
    select 
        date_trunc('month', min(date)) as low
    from responses
)
select distinct responses.user_id, tick as "date", first_value(score) over (partition by responses.user_id, tick.tick order by responses.date desc)
from 
    generate_series(
        (select low from boundaries), 
        date_trunc('month', current_date + interval '1 month'), interval '1 month'
    ) as tick
join responses on (
    -- # Proper 6 months calculation
    -- responses.date between tick.tick - interval '6 months' and tick.tick 
    -- # As it was in the example
    responses.date < tick.tick and 
    date_trunc('month', responses.date) > tick.tick - interval '6 months'
) 
order by 1, 2;

Related Solutions

MySQL/PostgreSQL – How to Select One Most Recent Post for Each Author

If you aim is to have queries with maximum efficiency, none of the above queries is really the best. Not always at least.

Efficiency depends on many different things, like the specific DBMS, the specific version (different versions have different improvements on the optimizer and the available syntax), the type of columns, the indexes available, the size of the tables and distribution of values, the hardware the server is running, the configuration settings etc.

You should always test various different ways of writing the queries, on your tables, with the sizes and distribution you expect to have on production, with your hardware and configuration settings, to decide which rewritings of the queries should be kept.

This specific kind of query is often called greatest-n-per-group (there is even a tag for it!) and under certain assumptions, one of the many ways to write them, is often quite efficient in both MySQL and PostgreSQL. It uses a LATERAL join in Postgres, which is available in 9.3+ versions (in SQL Server lingo CROSS/OUTER APPLY) and a simulation of this join in MySQL.

The assumptions are that the number of authors (the attribute we group by on) is small, compared to the number of posts (the table where we apply the group by). It's also best if there is an index or a table to find all the distinct author_id values and an additional index on the posts table for the group by.

This solution to the greatest-n-per-group problem matches also your request about ties, as it returns always one result per group. If you want to be precise about which one (of the tied) will be returned, the ORDER BY in the subquery can be modified (to ORDER BY pi.date DESC, pi.id DESC or ORDER BY pi.date DESC, a.name for example).

Query in PostgreSQL:

SELECT p.* 
FROM authors AS a
   , LATERAL 
       ( SELECT pi.*
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) AS p ;

Query in MySQL:

SELECT p.* 
FROM authors AS a
  JOIN posts AS p
    ON p.id =
       ( SELECT pi.id
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) ;

The useful index is on posts (author_id, date, id) for MySQL and or on posts (author_id, date DESC) for Postgres.

Needless to say again but before using any of the above, they should be tested in your environment and cross tested against all the many other versions/rewritings of the query. In Postgres for example, the DISTINCT ON syntax can be used in version older than 9.3. The resulting query is more compact than the LATERAL and might be more efficient, under different data distributions. Query:

SELECT DISTINCT ON (author_id) p.*
FROM posts AS p
ORDER BY p.author_id,
         p.date DESC ;

PostgreSQL – How to Select Specific Value from Table with Multiple Values for Primary Key

It looks like you need to define your user categories with CASE and PIVOT the result:

select organization
     , count(*) filter (where rgt=59) "Reviewer" 
     , count(*) filter (where rgt=66) "Data Entry" 
     , count(*) filter (where rgt=46) "Read Only" 
from( select users.*
           , case when exists( select *
                               from user_rights_rel 
                               where uid=id and rid in(59,32) ) then 59
                  when exists( select * 
                               from user_rights_rel 
                               where uid=id and rid=66 ) then 66
                  when exists( select *
                               from user_rights_rel 
                               where uid=id and rid=64 ) then 46 end rgt
      from users ) u 
group by organization
order by organization;

organization | Reviewer | Data Entry | Read Only
:----------- | -------: | ---------: | --------:
A            |        2 |          2 |         1
B            |        1 |          1 |         3

dbfiddle here