I need to compute the trailing 12 months sum of sales for a given client_id, based on each row given month.
Here is the initial table in a month by month aggregated sales by client (filtered here on a specific client 511656A75
):
CREATE TEMP TABLE foo AS
SELECT idclient, month_transac, sales
FROM ( VALUES
( '511656A75', '2010-06-01', 68.57 ),
( '511656A75', '2010-07-01', 88.63 ),
( '511656A75', '2010-08-01', 94.91 ),
( '511656A75', '2010-09-01', 70.66 ),
( '511656A75', '2010-10-01', 28.84 ),
( '511656A75', '2015-10-01', 85.00 ),
( '511656A75', '2015-12-01', 114.42 ),
( '511656A75', '2016-01-01', 137.08 ),
( '511656A75', '2016-03-01', 172.92 ),
( '511656A75', '2016-04-01', 125.00 ),
( '511656A75', '2016-05-01', 127.08 ),
( '511656A75', '2016-06-01', 104.17 ),
( '511656A75', '2016-07-01', 98.22 ),
( '511656A75', '2016-08-01', 37.08 ),
( '511656A75', '2016-10-01', 108.33 ),
( '511656A75', '2016-11-01', 104.17 ),
( '511656A75', '2017-01-01', 201.67 )
) AS t(idclient, month_transac, sales);
Please note that some months do not have any sales (no row), so I guess I cannot use a WINDOW
function (with the preceeding 12 rows for example).
Using this great answer for a similar problem (Rolling sum / count / average over date interval) I have done this query:
SELECT t1.idclient
, t1.month_transac
, t1.sales
, SUM(t2.sales) as sales_ttm
FROM temp_sales_sample_month_aggr t1
LEFT JOIN temp_sales_sample_month_aggr t2 USING (idclient)
WHERE
t1.idclient = '511656A75' -- for example only
AND t2.month_transac >= (t1.month_transac - interval '12 months')
AND t2.month_transac < t1.month_transac
GROUP BY 1, 2, 3
ORDER BY 2
;
Results are OK: sales_ttm
is the trailing 12 months sum of sales, without the sales of the row month (i.e. last row Jan 2017 sums all 2016 sales).
idclient | month_transac | sales | sales_ttm
-----------+---------------+--------+---------
511656A75 | 2010-07-01 | 88.63 | 68.57
511656A75 | 2010-08-01 | 94.91 | 157.20
[...]
511656A75 | 2015-12-01 | 114.42 | 824.83
511656A75 | 2016-01-01 | 137.08 | 892.17
511656A75 | 2016-03-01 | 172.92 | 752.75
511656A75 | 2016-04-01 | 125.00 | 925.67
511656A75 | 2016-05-01 | 127.08 | 1028.17
511656A75 | 2016-06-01 | 104.17 | 1155.25
511656A75 | 2016-07-01 | 98.22 | 1073.59
511656A75 | 2016-08-01 | 37.08 | 1171.81
511656A75 | 2016-10-01 | 108.33 | 1000.97
511656A75 | 2016-11-01 | 104.17 | 1024.30
511656A75 | 2017-01-01 | 201.67 | 1014.05
Problem is that the first month (here June 2010 – see 1st row values in initial table) is not in the resulting set, as there are no past sales for it, and therefore the LEFT JOIN has no row for it.
Expected/Wanted:
idclient | month_transac | sales | sales_ttm
-----------+---------------+--------+---------
511656A75 | 2010-06-01 | 68.57 | 0.00
511656A75 | 2010-07-01 | 88.63 | 68.57
511656A75 | 2010-08-01 | 94.91 | 157.20
511656A75 | 2010-09-01 | 70.66 | 252.11
[...]
I could add the row's sales (with a t2.month_transac <= t1.month_transac
and then substract it), but I suppose I can find a more elegant way.
I also tried to use LATERAL
join (as Erwin suggested in his anwser ("it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet"), but I guess I haven't grasped the way it works, as I only manage to get ERRORS.
- Do you confirm
WINDOW
function should be excluded? - Is there a way using a 'simple' LEFY JOIN to get all rows from
t1
? - Could
LATERAL
be useful is this case, and how? - What could be some optimization approaches?
Using PostgreSQL 9.6.2, Windows 10 or Ubuntu 16.04
Performance Evaluation
So we have 3 possible solutions so far; let's see which one performs better
I checked if the resulting tables were identical (they are).
Test done on a 270k rows table, knowing that it is a resulting table from a sample of 1% of all clients
Initial approach – LEFT JOIN
and GROUP BY
It is the corrected version of the suggested query in the question, i.e. include the current month in the sum, and substract the month's value from the sum, in order to have all rows.
SELECT t1.idclient
, t1.month_transac
, t1.sales
, SUM(t2.sales) - t1.sales as sales_ttm
FROM temp_sales_sample_month_aggr t1
LEFT JOIN temp_sales_sample_month_aggr t2 USING (idclient)
WHERE
t2.month_transac >= (t1.month_transac - interval '12 months') AND
t2.month_transac <= t1.month_transac
GROUP BY 1, 2, 3
ORDER BY 2
;
Query performance:
Planning time: 3.615 ms
Execution time: 1315.636 ms
@joanolo approach – Subquery
SELECT
t1.idclient
, t1.month_transac
, t1.sales
, (SELECT
coalesce(SUM(t2.sales), 0)
FROM
temp_sales_sample_month_aggr t2
WHERE
t2.idclient = t1.idclient
AND t2.month_transac >= (t1.month_transac - interval '12 months')
AND t2.month_transac < t1.month_transac
) AS sales_ttm
FROM
temp_sales_sample_month_aggr t1
GROUP BY
t1.idclient, t1.month_transac, t1.sales
ORDER BY
t1.month_transac ;
Query performance:
Planning time: 0.350 ms
Execution time: 3163.354 ms
I guess it has more rows to process with the subquery
LEFT JOIN LATERAL
approach
I finally managed to get it work.
SELECT t1.idclient
, t1.month_transac
, t1.sales
, COALESCE(lat.sales_ttm, 0.0)
FROM temp_sales_sample_month_aggr t1
LEFT JOIN LATERAL (
SELECT SUM(t2.sales) as sales_ttm
FROM temp_sales_sample_month_aggr t2
WHERE
t1.idclient = t2.idclient AND
t2.month_transac >= (t1.month_transac - interval '12 months') AND
t2.month_transac < t1.month_transac
) lat ON TRUE
ORDER BY 2
;
Query performance:
Planning time: 0.468 ms
Execution time: 2773.754 ms
So I guess LATERAL is not helping here, compared to the simpler LEFT JOIN
Best Answer
Something like this should work..
Here we
Calculate the date-range for the idclient in a CTE.
RIGHT OUTER
that CTE to a our sample dataset. We do this so we grow our sample dataset and we have entries with monthly_sales = 0 where needed.Use a window function that uses windows over
ROWS 12 PRECEDING
. That's the key. That's the past 12 months. The window function can't operate on rows that are null, so we set them to 0 before we get to this step.Select just the rows where
monthly_sales > 0
. We have to do this after the window function so as not to much with what is available for calculation (the window).Output,