Postgresql – Sum all unique values from previous dates

distinctgreatest-n-per-grouppostgresqlrunning-totalssum

Let's say, for simplicity sake, I have the following table:

id      amount     p_id     date
------------------------------------------------
1       5          1        2020-01-01T01:00:00
2       10         1        2020-01-01T01:10:00 
3       15         2        2020-01-01T01:20:00 
4       10         3        2020-01-01T03:30:00 
5       10         4        2020-01-01T03:50:00 
6       20         1        2020-01-01T03:40:00

Here's a sample response I want:

{
 "2020-01-01T01:00:00": 25, -- this is from adding records with ids: 2 and 3
 "2020-01-01T03:00:00": 55  -- this is from adding records with ids: 3,4,5 and 6
}

I want to get the total (sum(amount)) of all unique p_id's grouped by the hour.

The row chosen per p_id is the one with the latest date. So for example, the first value in the response above doesn't include id 1 because the record with id 2 has the same p_id and the date on that row is later.

The one tricky thing is I want to include the summation of all the amount per p_id if their date is before the hour presented. So for example, in the second value of the response (with key "2020-01-01T03:00:00"), even though id 3 has a timestamp in a different hour, it's the latest for that p_id 2 and therefore gets included in the sum for "2020-01-01T03:00:00". But the row with id 6 overrides id 2 with the same p_id 1.

In other words: always take the latest amount for each p_id so far, and compute the sum for every distinct hour found in the table.

Best Answer

Simple case

Getting the sum for a single, given hour is relatively simple:

SELECT timestamp '2020-01-01 03:00:00', sum(amount)
FROM  (
   SELECT DISTINCT ON (p_id) amount
   FROM   tbl
   WHERE  date < timestamp '2020-01-01 03:00:00' + interval '1h'
   ORDER  BY p_id, date DESC
   ) sub;

DISTINCT ON is typically substantially faster than subquery / CTE with row_number(). See:

For many rows per p_id, there is potential to optimize some more with smart indexing and a corresponding query. See:

Special running sum for every hour

You seem to want the sum for every distinct hour in the table.

You can simply distill distinct hours, and apply the simple solution to ever one of them in a LATERAL subquery:

WITH grid AS (  -- or some other source?
   SELECT DISTINCT date_trunc('hour', date) AS hour
   FROM   tbl
   )
SELECT *
FROM   grid
CROSS  JOIN LATERAL (
   SELECT sum(amount)
   FROM  (
      SELECT DISTINCT ON (p_id) amount
      FROM   tbl
      WHERE  date < grid.hour + interval '1h'
      ORDER  BY p_id, date DESC
      ) sub1
   ) sub2
ORDER  BY grid.hour;

Should be fine while there are not many distinct hours. But the query does not scale very well. The latest amount for every p_id is computed at full (steadily increasing) cost for every additional hour.

Optimize running sum for many hours

I expect this recursive CTE to scale better (while having considerable overhead initially), as it only needs to join to one row per p_id for every next hour:

WITH RECURSIVE
  values AS (
   SELECT DISTINCT ON (1, 2)
          date_trunc('hour', date) AS hour, p_id, amount
   FROM   tbl
   ORDER  BY 2, 1, date DESC  -- !
   )
, hours AS (
   SELECT hour, row_number() OVER (ORDER BY hour) AS hnr
   FROM  (SELECT DISTINCT hour FROM values) sub
   )
, cte AS (
   SELECT p_id, h.hour, v.amount, 2 AS next_hnr
   FROM  (SELECT DISTINCT p_id FROM values) i  -- all IDs
   JOIN   hours       h ON hnr = 1             -- first hour
   LEFT   JOIN values v USING (p_id, hour)

   UNION ALL
   SELECT c.p_id, h.hour, COALESCE(v.amount, c.amount), c.next_hnr + 1
   FROM   cte         c
   JOIN   hours       h ON h.hnr  = c.next_hnr
   LEFT   JOIN values v ON v.p_id = c.p_id
                       AND v.hour = h.hour
   )
SELECT hour, sum(amount)
FROM   cte
GROUP  BY 1
ORDER  BY 1;

Plus, there may be more potential to optimize, depending on value distribution, like indicated for the simple case at the top.

db<>fiddle here

Aside: calling your timestamp column "date" is rather misleading as that is a different basic data type.