Postgresql – Time-series window query for calculating rolling average across group by subquery is inaccurate due to missing rows

aggregatepostgresqlwindow functions

I'm trying to generate a rolling average based on a generated time series joined with a subquery on each day of the time series. However, the subquery occasionally does not have a result for a given day, and hence it does not return a row for said day. This breaks the rolling average calculation since it's based on preceding/following rows. It still calculates, but the calculation is wrong since it's not counting "0" days since those rows aren't present.

SELECT
    d::date AS day,
    updates.login as login,
    avg(coalesce(updates.counts, 0)) over (order by date(d) rows between 3 preceding and 3 following) as rollingavg
from generate_series(date '${from}' - 3, date '${to}' + 3, interval '1' DAY) AS t(d)
LEFT JOIN (
    SELECT
        date(item.updated_at) AS day,
        u.login,
        count(date(item.updated_at)) AS counts
    FROM
        user_item as item
        inner join user u ON u.id = item.user_id
    WHERE
        item.updated_at >= (date '${from}' - 3)
        item.updated_at <= (date '${to}' + 3)
    GROUP by day, login 
) updates ON updates.day = t.d

If I remove the login from the group by clause and modify the query to fit, it appears to work, but that's only because there is enough data that no day goes without at least one update. In the off chance there were a day without an update, this would experience the same problem.

My initial thought was to join a generated table that is essentially a cross join between the time series and every possible author, or possibly modify my from generate_series to represent such a table that includes a baseline '0' count for every user on every day, but something about that feels wrong.

Am I on the right path?

EDIT: Some simplified sample data to make it a bit more obvious what I'm getting at. This is focused purely on the subquery since that is where the problem begins:

User:

id	login
1	userA
2	userB

UserItem:

updated_at	user_id
'2020-01-01'	1
'2020-01-01'	2
'2020-01-02'	1
'2020-01-03'	2

The desired result is as follows:

date	user	count
'2020-01-01	userA	1
'2020-01-01	userB	1
'2020-01-02	userA	1
'2020-01-02	userB	0
'2020-01-03	userA	0
'2020-01-03	userB	1

But what I'm getting:

date	user	count
'2020-01-01	userA	1
'2020-01-01	userB	1
'2020-01-02	userA	1
'2020-01-03	userB	1

Best Answer

As Lennart mentioned, you'll need to CROSS JOIN your generated series with your users to create the required combinations of (Login,Date):

SELECT
  U.Login
 ,t.Date::date AS Day
 ,COUNT(UserItem.UserId) AS Count
 ,AVG(COUNT(UserItem.UserId)) OVER (PARTITION BY U.Login ORDER By t.Date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS RollingAvg
FROM
  "User" U
CROSS JOIN
  generate_series(date '2020-01-01' - 3, date '2020-01-01' + 3, interval '1' DAY) AS t(date)
LEFT JOIN
  UserItem UserItem
    ON UserItem.UserId = U.UserId
        AND UserItem.Updated_At = t.Date
GROUP BY
  U.Login
 ,t.Date
ORDER BY
  U.Login
 ,t.Date

You can restrict to only those users with activity in the date range by adding:

WHERE
  U.UserId IN
    (
      SELECT
        UserId
      FROM
        UserItem
      WHERE
        Updated_At >= date '2020-01-01' - 3 
        AND Updated_At <= date '2020-01-01' + 3
    )

Fiddle here.

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.

PostgreSQL: Generate a series of dates for each group in a table

1. `CROSS JOIN`, `LEFT JOIN LATERAL` to subquery

SELECT a.user_id, COALESCE(b.balance, 0) AS balance, d.as_of_date
FROM   (
   SELECT d::date AS as_of_date  -- cast to date right away
   FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
   ) d
JOIN   accounts a ON a.create_date <= d.as_of_date
LEFT   JOIN LATERAL (
   SELECT balance
   FROM   balances
   WHERE  user_id = a.user_id
   AND    as_of_date <= d.as_of_date
   ORDER  BY as_of_date DESC
   LIMIT  1
   ) b ON true
ORDER  BY a.user_id, d.as_of_date;

Returns your desired result - except that as_of_date is an actual date, not a timestamp like in your example. That should be more appropriate.

Users that are created already, but don't have any transactions, yet, are listed with a balance of 0. You did not define how to deal with the corner case.

Rather use timestamp input for generate_series():

Generating time series between two dates in PostgreSQL

It's crucial for performance that you back this up with a multicolumn index:

CREATE INDEX balances_multi_idx ON balances (user_id, as_of_date DESC, balance);

We have had a very similar case on SO just this week:

Aggregating the most recent joined records per week

Find more explanation there.

2. `CROSS JOIN`, `LEFT JOIN` , window functions

SELECT user_id
     , COALESCE(max(balance) OVER (PARTITION BY user_id, grp
                                   ORDER BY as_of_date), 0) AS balance
     , as_of_date
FROM  (
   SELECT a.user_id, b.balance, d.as_of_date
        , count(b.user_id) OVER (PARTITION BY user_id ORDER BY as_of_date) AS grp
   FROM   (
      SELECT d::date AS as_of_date  -- cast to date right away
      FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
      ) d
   JOIN   accounts a ON a.create_date <= d.as_of_date
   LEFT   JOIN balances b USING (user_id, as_of_date)
   ) sub
ORDER  BY user_id, as_of_date;

Same result. If you have the multicolumn index mentioned above and can get index-only scans out of it, the first solution is most probably faster.

The main feature is the running count of values to form groups. since count() does not count NULL values, all dates without balance fall into the same group (grp) as the most recent balance. Then use a simple max() over the same window frame extended by grp to copy the last balance for dangling gaps.

Best Answer

Related Solutions

Postgresql – Rolling sum / count / average over date interval

The query you have

The query you might want

The query you actually want

PostgreSQL: Generate a series of dates for each group in a table

1. CROSS JOIN, LEFT JOIN LATERAL to subquery

2. CROSS JOIN, LEFT JOIN , window functions

Related Question

1. `CROSS JOIN`, `LEFT JOIN LATERAL` to subquery

2. `CROSS JOIN`, `LEFT JOIN` , window functions