Let's say, for simplicity sake, I have the following table:
id amount p_id date
------------------------------------------------
1 5 1 2020-01-01T01:00:00
2 10 1 2020-01-01T01:10:00
3 15 2 2020-01-01T01:20:00
4 10 3 2020-01-01T03:30:00
5 10 4 2020-01-01T03:50:00
6 20 1 2020-01-01T03:40:00
Here's a sample response I want:
{
"2020-01-01T01:00:00": 25, -- this is from adding records with ids: 2 and 3
"2020-01-01T03:00:00": 55 -- this is from adding records with ids: 3,4,5 and 6
}
I want to get the total (sum(amount)
) of all unique p_id
's grouped by the hour.
The row chosen per p_id
is the one with the latest date
. So for example, the first value in the response above doesn't include id 1
because the record with id 2
has the same p_id
and the date
on that row is later.
The one tricky thing is I want to include the summation of all the amount
per p_id
if their date
is before the hour presented. So for example, in the second value of the response (with key "2020-01-01T03:00:00"), even though id 3
has a timestamp in a different hour, it's the latest for that p_id 2
and therefore gets included in the sum for "2020-01-01T03:00:00". But the row with id 6
overrides id 2
with the same p_id 1
.
In other words: always take the latest amount
for each p_id
so far, and compute the sum for every distinct hour found in the table.
Best Answer
Simple case
Getting the sum for a single, given hour is relatively simple:
DISTINCT ON
is typically substantially faster than subquery / CTE withrow_number()
. See:For many rows per
p_id
, there is potential to optimize some more with smart indexing and a corresponding query. See:Special running sum for every hour
You seem to want the sum for every distinct hour in the table.
You can simply distill distinct hours, and apply the simple solution to ever one of them in a
LATERAL
subquery:Should be fine while there are not many distinct hours. But the query does not scale very well. The latest
amount
for everyp_id
is computed at full (steadily increasing) cost for every additional hour.Optimize running sum for many hours
I expect this recursive CTE to scale better (while having considerable overhead initially), as it only needs to join to one row per
p_id
for every next hour:Plus, there may be more potential to optimize, depending on value distribution, like indicated for the simple case at the top.
db<>fiddle here
Aside: calling your
timestamp
column "date" is rather misleading as that is a different basic data type.