Fixing Slow Lateral Join on Small PostgreSQL Database

postgresqlpostgresql-12query-performance

My database consists of apps and their reviews (schema below). I'm trying to answer the following question:

Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?

This is the query that I've come up with to try and answer that question:

select
  review_windows.review_window_start,
  id,
  slug,
  review_total,
  earliest_review
from
  (
    select
      date_trunc('day', review_windows.review_windows) :: date as review_window_start
    from
      generate_series(
        (
          SELECT
            min(reviews.review_date)
          FROM
            reviews
        ),
        (
          SELECT
            max(reviews.review_date)
          FROM
            reviews
        ),
        '1 day'
      ) review_windows
    order by
      1 desc
  ) review_windows
  left join lateral (
    SELECT
      apps.id,
      apps.slug,
      count(reviews.*) as review_total,
      min(reviews.review_date) as earliest_review
    FROM
      reviews
      INNER JOIN apps ON apps.id = reviews.app_id
    where
      reviews.review_date >= review_windows.review_window_start
    group by
      1,
      2
    having
      min(reviews.review_date) >= review_windows.review_window_start
    order by
      3 desc,
      4 desc
    limit
      2
  ) apps_most_reviews on true;

It is extremely slow and I'm not sure why. If I want any kinds of results I use week instead of day in the generate_series call and even then that might take a minute or even longer.

Where should I start when debugging a performance issue like this?

Visualized query plan here

There are ~5K rows in apps and ~400K rows in reviews so it's a mystery to me why this is taking so long.

Running the individual subquery that is run for each entry in the lateral join given a single date only takes 161 ms (below) and the subquery for generate_series only takes 4 ms. I'm clearly doing something very wrong. Any help would be much appreciated!

Individual subquery with an explicit date

SELECT
  apps.id,
  apps.slug,
  count(reviews.*) as review_total,
  min(reviews.review_date) as earliest_review
FROM
  reviews
  INNER JOIN apps ON apps.id = reviews.app_id
where
  reviews.review_date >= '2018-04-17'::date
group by
  1,
  2
having
  min(reviews.review_date) >= '2018-04-17'::date
order by
  3 desc,
  4 desc
limit
  2

Tables

`apps`

Schema

|   | column_name | data_type    | is_nullable | foreign_key |
|---|-------------|--------------|-------------|-------------|
| 1 | id          | int4         | NO          |             |
| 2 | name        | varchar(255) | NO          |             |
| 3 | slug        | varchar(255) | NO          |             |

Indexes

| index_name      | index_algorithm | is_unique | column_name |
|-----------------|-----------------|-----------|-------------|
| apps_slug_index | BTREE           | t         | slug        |
| apps_pkey       | BTREE           | t         | id          |

`reviews`

Schema

|   | column_name   | data_type    | is_nullable | foreign_key     |
|---|---------------|--------------|-------------|-----------------|
| 1 | id            | int4         | NO          |                 |
| 2 | rating        | int4         | NO          |                 |
| 3 | review_date   | date         | NO          |                 |
| 4 | reviewer_name | varchar(255) | NO          |                 |
| 5 | review_body   | text         | NO          |                 |
| 6 | app_id        | int4         | NO          | public.apps(id) |

Indexes

| index_name                  | index_algorithm | is_unique | column_name   |
|-----------------------------|-----------------|-----------|---------------|
| reviews_reviewer_name_index | BTREE           | f         | reviewer_name |
| reviews_review_date_index   | BTREE           | f         | review_date   |
| reviews_pkey                | BTREE           | t         | id            |
| reviews_app_id_index        | BTREE           | f         | app_id        |

Best Answer

The main reason for the slowness is that you aggregate over the big table from scratch for every iteration of the lateral sibquery. Compute earliest review & current total count per app in a CTE once and base the lateral subquery on it. I discussed that and some other optizations under your predating related question on SO:

Get apps with the highest review count since a dynamic series of days

One difference: In this question you also return an additional attribute of the app (slug), so we need to join to table apps after all. Literally: join to apps after aggregating reviews. That's cheaper.

Another difference: earliest_review as additioal tiebreaker. That's just a gradual improvement, though, as there can stiill be 0-n winners per review_window_start. Instead of picking the (arbitrary) top two ( limit 2), select the first one, with ties. (Wording already hints at the new technique in Postgres 13; see below.)

WITH cte AS MATERIALIZED (
   SELECT a.id, a.slug, r.earliest_review, r.review_total
   FROM   apps a
   JOIN    (
      SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS review_total
      FROM   reviews
      GROUP  BY 1
      ) r ON r.app_id = a.id
   )
SELECT *
FROM  (
   SELECT generate_series(min(review_date)
                        , max(review_date)
                        , '1 day')::date
   FROM   reviews
   ) d(review_window_start)
LEFT  JOIN LATERAL (
   SELECT app_id, slug, review_total, earliest_review
   FROM  (
      SELECT app_id, slug, review_total, earliest_review
            ,  rank() OVER (ORDER BY review_total DESC, earliest_review DESC) AS rnk
      FROM   cte c
      WHERE  c.earliest_review >= d.review_window_start
      ) sub
   WHERE  rnk = 1
   ) a ON true;

Like in the related answer on SO, it will be cheaper yet in Postgres 13 using WITH TIES. See:

Greater than or equal to ALL() and equal to MAX() speed

...
LEFT  JOIN LATERAL (
   SELECT app_id, slug, review_total, earliest_review
   FROM   cte c
   WHERE  earliest_review >= d.review_window_start
   ORDER  BY review_total DESC, earliest_review DESC
   FETCH  FIRST 1 ROWS WITH TIES  -- new!
   ) a ON true;

Related Solutions

SQL Server Performance – Slow SQL Script for String Manipulation

You have a performance question so I recommend you read How to analyse SQL Server performance to begin with.

As about your script: you're doing it as slowly as possible. Row by painfully slow row. Not even a cursor slow, is worse than a cursor slow. What you have here is a single update:

update original_test_dataset
set duplicate= case when PATINDEX('%dup%', id)=0 then 0 else 1 end,
 id_only = REPLACE((LEFT(REPLACE(id,'rec-',''), CHARINDEX('-',REPLACE(id,'rec-',''),0)),'-','');

That's it, nothing else.

What you have in the original script is slow because:

it operates on row by row as opposed to sets. SQL is designed to manipulate sets, not rows.
it perform a lot of writes in standalone commits resulting in huge flush commit waits. Batch commit should be used.
it performs multiple unindexed end-to-end scans on #ControlTable (every operation on #ControlTable is an unindexed end-to-end scan, it should had an index on id, yes #temp tables do need indexes)
it does two updates per row instead of one update of two fields
I can't know whether the access to original_test_dataset is in any way optimized, in lack of an index on id is likely is not.

declare @TableID varchar(20) vs. [id] [varchar](50) NULL: you're silently truncating data. Basically, your script corrupts the table.

PostgreSQL Performance – Fixing Very Slow JOIN Queries

Another good query variant is:

SELECT p.user_id, p.cnt AS PostCount
FROM users u
INNER JOIN (
    select user_id, count(id) as cnt from posts group by user_id
) as p on p.user_id = u.id
WHERE u.username IS NOT NULL          
ORDER BY PostCount DESC LIMIT 100;

It doesn't exploit CTE and gives correct answer (and CTE example may produce less than 100 rows in theory cause it first limits then joins with users).

I suppose, MSSQL able to perform such transformation in its query optimizer, and PostgreSQL is not able to push down aggregation under join. Or MSSQL just have much faster hash join implementation.