Postgresql – How to optimize the transaction level running balances cartesian join

join;postgresqlset-returning-functionswindow functions

This is a continuation of an question found here:

Generate multiple running totals with GROUP BY day

This is a continuation of a previous question, where table definition and sample data can be found. (Huge thanks to @Erwin Brandstetter for the help there).

All of this is being done on a PostgreSQL 11.5 DB.

I am trying to sort out ways in which I can optimize the giant cartesian join that seems necessary in the query below:

SELECT c.customer_id, d.the_day
 , sum(t.tsla_amount) OVER w AS tsla_running_amount
 , sum(t.goog_amount) OVER w AS goog_running_amount
FROM   (
   SELECT the_day::date
   FROM   generate_series(timestamp '2019-01-01'
                    , date_trunc('day', localtimestamp)
                    , interval '1 day') the_day
   ) d 
CROSS  JOIN (SELECT DISTINCT customer_id FROM transactions) c  -- !
LEFT   JOIN (
SELECT customer_id
    , created_at::date AS the_day
    , sum(t.amount) FILTER (WHERE stock_ticker = 'tsla') AS tsla_amount
    , sum(t.amount) FILTER (WHERE stock_ticker = 'goog') AS goog_amount
FROM   transactions t
WHERE  created_at >= timestamp '2019-01-01'
GROUP  BY customer_id, created_at::date
) t USING (customer_id, the_day)
WINDOW w AS (PARTITION BY customer_id ORDER BY the_day)
ORDER  BY customer_id, the_day;

While this query works, the end goal that I am trying to achieve is that, instead of every single customer having an entry on each day, I want only the customers that have transacted up through that day to be included in the report, and then as soon as a "new" customer has a transaction, they are then included moving forward.

(Currently this query creates rows for each customer even on days in the past in which they never had a transaction yet, and defaults values to 0)

CUSTOMER DDL:

CREATE TABLE customers (
customer_id varchar(255) NOT NULL,
city varchar(255) NULL,
state varchar(255) NULL,
postal_code varchar(255) NULL,
inserted_at timestamp NOT NULL,
updated_at timestamp NOT NULL,
CONSTRAINT customers_pkey PRIMARY KEY (customer_id)
);


CREATE TRANSACTION DDL:

CREATE TABLE public.transactions (
transaction_id varchar(255) NOT NULL,
amount float8 NOT NULL,
stock_ticker varchar(255) NOT NULL,
transaction_type varchar(255) NOT NULL,
customer_id varchar NOT NULL,
inserted_at timestamp NOT NULL,
created_at timestamp NOT NULL,
CONSTRAINT transactions_pkey PRIMARY KEY (transaction_id)
);

Example:
Old customers are making stock transactions from 01-01-20 forward
New Customer makes their 1st stock transaction on 01-03-20 (purchase 2 goog)

The report generated in this case for these customers would have an entry for each day, for each customer showing a running balance of their stock transactions:

01-01-2020 – old_123_cust_id – 5 – tsla
01-01-2020 – old_234_cust_id – 10 – goog

01-02-2020 – old_123_cust_id – 5 – tsla
01-02-2020 – old_234_cust_id – 10 – goog

01-03-2020 – old_123_cust_id – 5 – tsla
01-03-2020 – old_234_cust_id – 10 – goog
01-03-2020 – NEW_567_cust_id – 2 – goog

01-04-2020 – old_123_cust_id – 5 – tsla
01-04-2020 – old_234_cust_id – 10 – goog
01-04-2020 – NEW_567_cust_id – 2 – goog

In this example, the NEW_567_cust_id, does not appear in the report, until their 1st transaction, and then moving forward, that customer continues to appear in the report, even if they do not make any additional transactions.

Additional Information:
Estimated transactions row count: 300,000
Estimated customers row count: 45,000
Earliest transaction: 01-01-2019
Range: 01-01-2019 -> Today() (When query runs)
The day of a customer's 1st transaction will never change

Happy to provide any additional information!

Best Answer

You can get the day of the first transaction for every customer dynamically. With proper indexing this is only half painful. There is ongoing work to include "index skip scan" in the next version of Postgres (13 as of writing) to make it less painful. See:

But it would be simpler and cheaper to save that information. It would seem that the date of the first transaction per customer should not change later, like a birthday? (And you confirmed as much in an update.) Let's call it first_transaction_at to fit in with your current schema.

For customers that have yet to make their first transaction (first_transaction_at IS NULL) you can still check dynamically, or attempt to update every time before you run the query.

Or if you can't or won't change the table definition you might add a MATERIALIZED VIEW for just that purpose: to keep track of first_transaction_at per customer.

The updated table can look like:

CREATE TABLE customers (
   customer_id varchar(255) PRIMARY KEY,
   city varchar(255) NULL,
   state varchar(255) NULL,
   postal_code varchar(255) NULL,
   first_transaction_at timestamp,   -- can be NULL !
   inserted_at timestamp NOT NULL,
   updated_at timestamp NOT NULL
);

Then you can use a query like this:

SELECT c.customer_id, d.the_day
     , sum(t.tsla_amount) OVER w AS tsla_running_amount
     , sum(t.goog_amount) OVER w AS goog_running_amount
FROM   customers c  -- !
CROSS  JOIN LATERAL (
   SELECT the_day::date
   FROM   generate_series(first_transaction_at  -- !
                        , date_trunc('day', localtimestamp)
                        , interval '1 day') the_day
   ) d 
LEFT   JOIN (
   SELECT customer_id
        , created_at::date AS the_day
        , sum(t.amount) FILTER (WHERE stock_ticker = 'tsla') AS tsla_amount
        , sum(t.amount) FILTER (WHERE stock_ticker = 'goog') AS goog_amount
   FROM   transactions t
   WHERE  created_at >= timestamp '2020-01-01'
   GROUP  BY customer_id, created_at::date
   ) t USING (customer_id, the_day)
WINDOW w AS (PARTITION BY customer_id ORDER BY the_day)
ORDER  BY customer_id, the_day;

db<>fiddle here

The point being that you don't use the complete range of days for every customer, but a customized range, generated in the LATERAL subquery.

The CROSS JOIN removes customers with first_transaction_at IS NULL from the query as no rows are produced in the subquery.

Aside 1: The data type varchar(255) typically is a misunderstanding in Postgres - and a suboptiomal idea for a PRIMARY KEY. See:

Aside 2: Consider timestamptz instead of timestamp. See: