Postgresql – How to improve the “algorithm” for figuring out the average spending on food per month

postgresqlquery

I have a database table keeping track of every grocery delivery. It's stored as timestamps and sums (EUR).

Currently, I do this:

SELECT (SUM("amount in EUR") / 3) 
FROM personal."food purchases" 
WHERE "purchase time" >= now() - INTERVAL '3 months';

Now, let's say that the current day is 1st of September 2020. This "algo" will count together all the deliveries from exactly three months back from the current moment, which I guess means from the first day of June, and then simply slash this price into 3 to symbolize "three months".

Sometimes, depending on the day, this will give a reasonable value. Other times, and I cannot fully describe/understand why, this gives either a too high or too low value. I suspect that the reason for this is that my food purchases/deliveries don't always happen on the exact same day, and vary quite a bit in price as well. Thus, the resulting sum, even though I do an interval of three months and then slash with three, and thus "should" give the average food cost "recently", is oftentimes misleading.

I can't tell for sure, but I think that it sometimes includes one purchase "too many", and sometimes "one too few". It might even be worse than that.

Can you help me improve my formula to be more "fair"? I want to know the average amount I pay for food each month, and I'm not convinced that I'm getting that value with my current query.

I think I need to drop the part which goes three months back from the current time, and instead somehow calculates the… last three… months… um… Frankly, I don't know what I need to do. I just don't think that the current method is correct.

I'd appreciate if your suggestion/solution stays as close as possible to my original query. I want to understand what is happening. I actually suspect that this is a very common issue and probably built into PG and all other database softwares.

Best Answer

I would first calculate the sum per month, then calculate the average from that:

select avg(amount) as avg_amount_per_month
from (
  SELECT date_trunc('month', "purchase time"), SUM("amount in EUR") as amount
  FROM personal."food purchases" 
  WHERE "purchase time" >= now() - INTERVAL '3 months'
  group by date_trunc('month', "purchase time")
) t
;

The inner query will return the amount for each month of the chose interval (yielding three rows), and the outer query will then calculate the average from that.

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.

Best Answer

Related Solutions

Mongodb – How to store and query products the price of which depends on a complex set of conditions

PostgreSQL – How to Calculate Rolling Sum, Count, and Average Over Date Interval

The query you have

The query you might want

The query you actually want

Related Question