Postgresql – Time-series window query for calculating rolling average across group by subquery is inaccurate due to missing rows

aggregatepostgresqlwindow functions

I'm trying to generate a rolling average based on a generated time series joined with a subquery on each day of the time series. However, the subquery occasionally does not have a result for a given day, and hence it does not return a row for said day. This breaks the rolling average calculation since it's based on preceding/following rows. It still calculates, but the calculation is wrong since it's not counting "0" days since those rows aren't present.

SELECT
    d::date AS day,
    updates.login as login,
    avg(coalesce(updates.counts, 0)) over (order by date(d) rows between 3 preceding and 3 following) as rollingavg
from generate_series(date '${from}' - 3, date '${to}' + 3, interval '1' DAY) AS t(d)
LEFT JOIN (
    SELECT
        date(item.updated_at) AS day,
        u.login,
        count(date(item.updated_at)) AS counts
    FROM
        user_item as item
        inner join user u ON u.id = item.user_id
    WHERE
        item.updated_at >= (date '${from}' - 3)
        item.updated_at <= (date '${to}' + 3)
    GROUP by day, login 
) updates ON updates.day = t.d

If I remove the login from the group by clause and modify the query to fit, it appears to work, but that's only because there is enough data that no day goes without at least one update. In the off chance there were a day without an update, this would experience the same problem.

My initial thought was to join a generated table that is essentially a cross join between the time series and every possible author, or possibly modify my from generate_series to represent such a table that includes a baseline '0' count for every user on every day, but something about that feels wrong.

Am I on the right path?

EDIT: Some simplified sample data to make it a bit more obvious what I'm getting at. This is focused purely on the subquery since that is where the problem begins:

User:

id login
1 userA
2 userB

UserItem:

updated_at user_id
'2020-01-01' 1
'2020-01-01' 2
'2020-01-02' 1
'2020-01-03' 2

The desired result is as follows:

date user count
'2020-01-01 userA 1
'2020-01-01 userB 1
'2020-01-02 userA 1
'2020-01-02 userB 0
'2020-01-03 userA 0
'2020-01-03 userB 1

But what I'm getting:

date user count
'2020-01-01 userA 1
'2020-01-01 userB 1
'2020-01-02 userA 1
'2020-01-03 userB 1

Best Answer

As Lennart mentioned, you'll need to CROSS JOIN your generated series with your users to create the required combinations of (Login,Date):

SELECT
  U.Login
 ,t.Date::date AS Day
 ,COUNT(UserItem.UserId) AS Count
 ,AVG(COUNT(UserItem.UserId)) OVER (PARTITION BY U.Login ORDER By t.Date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS RollingAvg
FROM
  "User" U
CROSS JOIN
  generate_series(date '2020-01-01' - 3, date '2020-01-01' + 3, interval '1' DAY) AS t(date)
LEFT JOIN
  UserItem UserItem
    ON UserItem.UserId = U.UserId
        AND UserItem.Updated_At = t.Date
GROUP BY
  U.Login
 ,t.Date
ORDER BY
  U.Login
 ,t.Date

You can restrict to only those users with activity in the date range by adding:

WHERE
  U.UserId IN
    (
      SELECT
        UserId
      FROM
        UserItem
      WHERE
        Updated_At >= date '2020-01-01' - 3 
        AND Updated_At <= date '2020-01-01' + 3
    )

Fiddle here.