Postgresql – How to optimize the transaction level running balances cartesian join

join;postgresqlset-returning-functionswindow functions

This is a continuation of an question found here:

Generate multiple running totals with GROUP BY day

This is a continuation of a previous question, where table definition and sample data can be found. (Huge thanks to @Erwin Brandstetter for the help there).

All of this is being done on a PostgreSQL 11.5 DB.

I am trying to sort out ways in which I can optimize the giant cartesian join that seems necessary in the query below:

SELECT c.customer_id, d.the_day
 , sum(t.tsla_amount) OVER w AS tsla_running_amount
 , sum(t.goog_amount) OVER w AS goog_running_amount
FROM   (
   SELECT the_day::date
   FROM   generate_series(timestamp '2019-01-01'
                    , date_trunc('day', localtimestamp)
                    , interval '1 day') the_day
   ) d 
CROSS  JOIN (SELECT DISTINCT customer_id FROM transactions) c  -- !
LEFT   JOIN (
SELECT customer_id
    , created_at::date AS the_day
    , sum(t.amount) FILTER (WHERE stock_ticker = 'tsla') AS tsla_amount
    , sum(t.amount) FILTER (WHERE stock_ticker = 'goog') AS goog_amount
FROM   transactions t
WHERE  created_at >= timestamp '2019-01-01'
GROUP  BY customer_id, created_at::date
) t USING (customer_id, the_day)
WINDOW w AS (PARTITION BY customer_id ORDER BY the_day)
ORDER  BY customer_id, the_day;

While this query works, the end goal that I am trying to achieve is that, instead of every single customer having an entry on each day, I want only the customers that have transacted up through that day to be included in the report, and then as soon as a "new" customer has a transaction, they are then included moving forward.

(Currently this query creates rows for each customer even on days in the past in which they never had a transaction yet, and defaults values to 0)

CUSTOMER DDL:

CREATE TABLE customers (
customer_id varchar(255) NOT NULL,
city varchar(255) NULL,
state varchar(255) NULL,
postal_code varchar(255) NULL,
inserted_at timestamp NOT NULL,
updated_at timestamp NOT NULL,
CONSTRAINT customers_pkey PRIMARY KEY (customer_id)
);


CREATE TRANSACTION DDL:

CREATE TABLE public.transactions (
transaction_id varchar(255) NOT NULL,
amount float8 NOT NULL,
stock_ticker varchar(255) NOT NULL,
transaction_type varchar(255) NOT NULL,
customer_id varchar NOT NULL,
inserted_at timestamp NOT NULL,
created_at timestamp NOT NULL,
CONSTRAINT transactions_pkey PRIMARY KEY (transaction_id)
);

Example:
Old customers are making stock transactions from 01-01-20 forward
New Customer makes their 1st stock transaction on 01-03-20 (purchase 2 goog)

The report generated in this case for these customers would have an entry for each day, for each customer showing a running balance of their stock transactions:

01-01-2020 – old_123_cust_id – 5 – tsla
01-01-2020 – old_234_cust_id – 10 – goog

01-02-2020 – old_123_cust_id – 5 – tsla
01-02-2020 – old_234_cust_id – 10 – goog

01-03-2020 – old_123_cust_id – 5 – tsla
01-03-2020 – old_234_cust_id – 10 – goog
01-03-2020 – NEW_567_cust_id – 2 – goog

01-04-2020 – old_123_cust_id – 5 – tsla
01-04-2020 – old_234_cust_id – 10 – goog
01-04-2020 – NEW_567_cust_id – 2 – goog

In this example, the NEW_567_cust_id, does not appear in the report, until their 1st transaction, and then moving forward, that customer continues to appear in the report, even if they do not make any additional transactions.

Additional Information:
Estimated transactions row count: 300,000
Estimated customers row count: 45,000
Earliest transaction: 01-01-2019
Range: 01-01-2019 -> Today() (When query runs)
The day of a customer's 1st transaction will never change

Happy to provide any additional information!

Best Answer

You can get the day of the first transaction for every customer dynamically. With proper indexing this is only half painful. There is ongoing work to include "index skip scan" in the next version of Postgres (13 as of writing) to make it less painful. See:

But it would be simpler and cheaper to save that information. It would seem that the date of the first transaction per customer should not change later, like a birthday? (And you confirmed as much in an update.) Let's call it first_transaction_at to fit in with your current schema.

For customers that have yet to make their first transaction (first_transaction_at IS NULL) you can still check dynamically, or attempt to update every time before you run the query.

Or if you can't or won't change the table definition you might add a MATERIALIZED VIEW for just that purpose: to keep track of first_transaction_at per customer.

The updated table can look like:

CREATE TABLE customers (
   customer_id varchar(255) PRIMARY KEY,
   city varchar(255) NULL,
   state varchar(255) NULL,
   postal_code varchar(255) NULL,
   first_transaction_at timestamp,   -- can be NULL !
   inserted_at timestamp NOT NULL,
   updated_at timestamp NOT NULL
);

Then you can use a query like this:

SELECT c.customer_id, d.the_day
     , sum(t.tsla_amount) OVER w AS tsla_running_amount
     , sum(t.goog_amount) OVER w AS goog_running_amount
FROM   customers c  -- !
CROSS  JOIN LATERAL (
   SELECT the_day::date
   FROM   generate_series(first_transaction_at  -- !
                        , date_trunc('day', localtimestamp)
                        , interval '1 day') the_day
   ) d 
LEFT   JOIN (
   SELECT customer_id
        , created_at::date AS the_day
        , sum(t.amount) FILTER (WHERE stock_ticker = 'tsla') AS tsla_amount
        , sum(t.amount) FILTER (WHERE stock_ticker = 'goog') AS goog_amount
   FROM   transactions t
   WHERE  created_at >= timestamp '2020-01-01'
   GROUP  BY customer_id, created_at::date
   ) t USING (customer_id, the_day)
WINDOW w AS (PARTITION BY customer_id ORDER BY the_day)
ORDER  BY customer_id, the_day;

db<>fiddle here

The point being that you don't use the complete range of days for every customer, but a customized range, generated in the LATERAL subquery.

The CROSS JOIN removes customers with first_transaction_at IS NULL from the query as no rows are produced in the subquery.

Aside 1: The data type varchar(255) typically is a misunderstanding in Postgres - and a suboptiomal idea for a PRIMARY KEY. See:

Refactor foreign key to fields

Aside 2: Consider timestamptz instead of timestamp. See:

Ignoring time zones altogether in Rails and PostgreSQL

Related Solutions

Postgresql – Looking for ways to optimize sql query with inner join

Your statistics are way off.

In the scan on index_event_subscriptions_on_user_id_group_id it estimates 4 rows returned, yet 7295 are returned.

WHERE (event_subscriptions.user_id = 456) 
AND (event_subscriptions.group_id IN (123, 321, 222, 333)) 
AND (event_subscriptions.valid IS true)

The additional AND (events.type != 1) is not removing anything.

I would try running ANALYZE event_subscriptions (or, even better, if you can afford it VACUUM FULL ANALYZE event_subscriptions see if that fixes it. If it doesn't I would try ALTER TABLE event_subscriptions SET STATISTICS 10000 and try ANALYZE event_subscriptions again to see if that fixes it.

If neither of these worked, I would try pushing events.type check down..

INNER JOIN event_subscriptions 
  ON event_subscriptions.event_id = events.id 
  AND events.type <> 1

At the point that you're talking 220 million rows, you may also want to consider partitioning.

Sql-server – Optimize a slow nested loops join

I would add OPTION(USE HINT 'DISABLE_OPTIMIZER_ROWGOAL') to the query to disable row goals.

The easiest way to spot the use of row goals is when using the TOP() operator, used to return 45 rows as quickly as possible in your case.

The use of loop joins is more common here as it is estimating smaller result sets & using non blocking operators.

Row goals will in most cases help, but with big convoluted queries I have seen them being bothersome in the past.

By disabling these row goals you should get a more optimized plan for the entire query.

You could also disable this by adding OPTION(QUERYTRACEON 4138) to the end of the query.

I tried to remove TOP 45 part - it works now less than 1 second. Where could I get more insight on row goals?

Removing the TOP(45) removed the row goal & improved performance in this case.

Paul White has some great blog posts on row goals. Parts 1 and 2 can be found here and here. There is also Inside the Optimizer: Row Goals In Depth.

If you cannot change the query, a plan guide could be added for the specific query to disable the row goal.

Another method to remove the row goal without adding the hint(s)

Following the statement:

A second general point to bear in mind is that a row goal is only set when the goal would be less than the regular estimate

Again from: Setting and Identifying Row Goals in Execution Plans

You could specify a nonsensical TOP() with such a high number that the row goal will not be set and then wrap your actual TOP() around it.

For example:

SELECT TOP(5) *
FROM(
SELECT TOP(99999999999) *
FROM dbo.table
ORDER BY column ASC
) AS  A;

More examples:

DB<>Fiddle

In short, the TOP(5) rowgoal is overriden by the much higher top(99999999999) in the inner query, resulting in only 2 actual row goals remaining in my example, one on the TOP(99999999) operator:

PhysicalOp="Top" LogicalOp="Top" EstimateRows="5" EstimateRowsWithoutRowGoal="9.4199"

And one on the Nested loops operator right after the second TOP() that exactly matches the EstimateRows

PhysicalOp="Nested Loops" LogicalOp="Inner Join" EstimateRows="9.4199" EstimateRowsWithoutRowGoal="9.4199"

For a real world example with a row goal issue such as yours, I was able to change the plan shape with non-blocking operators such as NL joins due to a TOP(5) row goal:

Actual performance issue not shown in screenshots

To hash match operators by using the TOP(999999999) method.
This plan shape matches the query when applying the row goal disable hints.

I would still prefer to add the query hint instead of writing the two TOP()s. But if you are not allowed to or can't, this could help.

Best Answer

Related Solutions

Postgresql – Looking for ways to optimize sql query with inner join

Sql-server – Optimize a slow nested loops join

Related Question