Postgresql – Efficiently computing aggregate functions over subqueries with incremental data

performancepostgresqlpostgresql-9.3query-performance

I have a PostgreSQL database (version 9.3.6) containing a large number of orders. As the orders are processed, scan_events are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed scan.

I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.

Schema

CREATE TABLE orders (
    id character varying(40) NOT NULL,
    date_created timestamp with time zone NOT NULL
);

ALTER TABLE ONLY orders
    ADD CONSTRAINT orders_pkey PRIMARY KEY (id);

CREATE TABLE scan_events (
    id character varying(100) NOT NULL,
    order_id character varying(40) NOT NULL,
    "time" timestamp with time zone NOT NULL
);

CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);

Desired Computation

For days_elapsed values ranging from 1 to 14 days, I want the means and standard deviations of:

  1. the set of percentages of orders that received any scans within days_elapsed days of their orders.date_createdfrom each of the past 30 days (aka grouped by DATE(orders.date_created))

  2. the set of percentage of orders that received a scan with completed = TRUE within days_elapsed days of their orders.date_created from each of the past 30 days (aka grouped by DATE(orders.date_created))

Ideally, the output would look something like this, but honestly, anything performant is fine.

output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed

Current Progress

I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE to get the completed scan results):

SELECT DATE(orders.date_created),
    COUNT(DISTINCT orders.id) AS total, 
    COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
    COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;

For days_elapsed = 1, this query is roughly what I imagine should work:

SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
    STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
    AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
    STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM ( 
    SELECT DATE(orders.date_created),
        COUNT(DISTINCT orders.id) AS total, 
        COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
        COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
    FROM orders
    LEFT JOIN scan_events ON orders.id = scan_events.order_id
    WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
    GROUP BY DATE(orders.date_created)
) counts

The problem with this is that it's most certainly redoing work that it doesn't have to…

Things that I imagine we can take advantage of:

  • AVG and STDDEV ignore null values so we can do some CASE WHEN ... THEN ... END trickery
  • The sets that AVG and STDDEV are calculated over could maybe be built up incrementally as we increase days_elapsed by one

Any help would be much appreciated — my SQL-fu is not up to snuff 🙁

Best Answer

The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like

DailyTotals (
  OrderDate,
  NumberOfValues,
  SumOfValues,
  SumOfSquareOfValues);

Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.

When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.