PostgreSQL Interval Calculation – How to Calculate Interval in Days Based on Average Velocity

postgresqlpostgresql-9.6

I have an orders table and need to calculate when there is a change in the average count across time. This could be either an increase or decrease.

For instance, if there is usually 1 order every two days, I need to detect if that suddenly increased to 3 orders every day or decreased to 1 order every 3 days.

Example table structure and data:

CREATE TABLE "public"."orders" (
    "id" SERIAL,
    "created_at" timestamp NOT NULL,
    PRIMARY KEY ("id")
);

INSERT INTO "orders" ("created_at")
VALUES
(now() - interval '1 hours'),
(now() - interval '6 hours'),
(now() - interval '12 hours'),
(now() - interval '2 days'),
(now() - interval '4 days'),
(now() - interval '6 days'),
(now() - interval '9 days');

In the data set above, supposed I wanted to detect that over the course of the last week there is usually only 1 order every 2 days, but over the last 24 hours that pattern has been violated with 3 orders in one day.

Desired result:

Previous 7 day's average orders per day (not including last 24 hours): 0.43
Previous 7 day's average order velocity (not including last 24 hours): 1 order every 2 days
Last 24 hour's orders: 3

Alternate scenario:

INSERT INTO "orders" ("created_at")
VALUES
(now() - interval '12 hours'),
(now() - interval '2 days'),
(now() - interval '2 days'),
(now() - interval '2 days'),
(now() - interval '3 days'),
(now() - interval '3 days'),
(now() - interval '3 days'),
(now() - interval '4 days'),
(now() - interval '4 days'),
(now() - interval '4 days'),
(now() - interval '5 days'),
(now() - interval '5 days'),
(now() - interval '5 days'),
(now() - interval '6 days'),
(now() - interval '6 days'),
(now() - interval '6 days'),
(now() - interval '7 days'),
(now() - interval '7 days'),
(now() - interval '7 days');

Alternatively, suppose I have the above data set. Normally there are approximately 3 orders every day for the past week, but over the last 24 hours there was only 1 order.

Desired result:

Previous 7 day's average orders (not including last 24 hours): 2.57
Previous 7 day's average order velocity (not including last 24 hours): 3 orders every 1 days
Last 24 hour's orders: 1

How would I do this?

PostgreSQL 9.6

Best Answer

First count orders grouped by day of last 7 days, and then calculate averages

--= Count orders grouping by date
WITH cn AS
(
    SELECT created_at::date, COUNT(*) num_orders
    FROM   orders
    WHERE  created_at <  (now() - interval '1 days')
    AND    created_at >= (now() - interval '8 days')
    GROUP BY created_at::date
)
SELECT to_char(AVG(num_orders), '9.99') AS "AvgOrderVelocity",
       to_char(SUM(num_orders) / 7::decimal(18,2), '0.99')  AS "AvgLast7Days",
       (SELECT COUNT(*)
        FROM   orders
        WHERE  created_at > (now() - interval '1 days')) "Last24h"
FROM   cn;

Using first sample data:

Orders grouped by date:

created_at | num_orders
:--------- | ---------:
2017-05-05 |          1
2017-05-07 |          1
2017-05-09 |          1

AvgOrderVelocity | AvgLast7Days | Last24h
:--------------- | :----------- | ------:
 1.00            |  0.43        |       3

dbfiddle here

Using second sample data:

Orders grouped by date:

created_at | num_orders
:--------- | ---------:
2017-05-04 |          3
2017-05-05 |          3
2017-05-06 |          3
2017-05-07 |          3
2017-05-08 |          3
2017-05-09 |          3

AvgOrderVelocity | AvgLast7Days | Last24h
:--------------- | :----------- | ------:
 3.00            |  2.57        |       1

dbfiddle here

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.

Postgresql – Efficiently computing aggregate functions over subqueries with incremental data

The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like

DailyTotals (
  OrderDate,
  NumberOfValues,
  SumOfValues,
  SumOfSquareOfValues);

Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.

When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.

Best Answer

Related Solutions

PostgreSQL – How to Calculate Rolling Sum, Count, and Average Over Date Interval

The query you have

The query you might want

The query you actually want

Postgresql – Efficiently computing aggregate functions over subqueries with incremental data

Related Question