My database consists of apps and their reviews (schema below). I'm trying to answer the following question:
Given a series of dates from the earliest reviews.review_date
to the latest reviews.review_date
(incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?
This is the query that I've come up with to try and answer that question:
select
review_windows.review_window_start,
id,
slug,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 day'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
apps.slug,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1,
2
having
min(reviews.review_date) >= review_windows.review_window_start
order by
3 desc,
4 desc
limit
2
) apps_most_reviews on true;
It is extremely slow and I'm not sure why. If I want any kinds of results I use week
instead of day
in the generate_series
call and even then that might take a minute or even longer.
Where should I start when debugging a performance issue like this?
Visualized query plan here
There are ~5K rows in apps
and ~400K rows in reviews
so it's a mystery to me why this is taking so long.
Running the individual subquery that is run for each entry in the lateral join given a single date only takes 161 ms
(below) and the subquery for generate_series
only takes 4 ms
. I'm clearly doing something very wrong. Any help would be much appreciated!
Individual subquery with an explicit date
SELECT
apps.id,
apps.slug,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= '2018-04-17'::date
group by
1,
2
having
min(reviews.review_date) >= '2018-04-17'::date
order by
3 desc,
4 desc
limit
2
Tables
apps
Schema
| | column_name | data_type | is_nullable | foreign_key |
|---|-------------|--------------|-------------|-------------|
| 1 | id | int4 | NO | |
| 2 | name | varchar(255) | NO | |
| 3 | slug | varchar(255) | NO | |
Indexes
| index_name | index_algorithm | is_unique | column_name |
|-----------------|-----------------|-----------|-------------|
| apps_slug_index | BTREE | t | slug |
| apps_pkey | BTREE | t | id |
reviews
Schema
| | column_name | data_type | is_nullable | foreign_key |
|---|---------------|--------------|-------------|-----------------|
| 1 | id | int4 | NO | |
| 2 | rating | int4 | NO | |
| 3 | review_date | date | NO | |
| 4 | reviewer_name | varchar(255) | NO | |
| 5 | review_body | text | NO | |
| 6 | app_id | int4 | NO | public.apps(id) |
Indexes
| index_name | index_algorithm | is_unique | column_name |
|-----------------------------|-----------------|-----------|---------------|
| reviews_reviewer_name_index | BTREE | f | reviewer_name |
| reviews_review_date_index | BTREE | f | review_date |
| reviews_pkey | BTREE | t | id |
| reviews_app_id_index | BTREE | f | app_id |
Best Answer
The main reason for the slowness is that you aggregate over the big table from scratch for every iteration of the lateral sibquery. Compute earliest review & current total count per app in a CTE once and base the lateral subquery on it. I discussed that and some other optizations under your predating related question on SO:
One difference: In this question you also return an additional attribute of the app (
slug
), so we need to join to tableapps
after all. Literally: join toapps
after aggregatingreviews
. That's cheaper.Another difference:
earliest_review
as additioal tiebreaker. That's just a gradual improvement, though, as there can stiill be 0-n winners perreview_window_start
. Instead of picking the (arbitrary) top two (limit 2
), select the first one, with ties. (Wording already hints at the new technique in Postgres 13; see below.)Like in the related answer on SO, it will be cheaper yet in Postgres 13 using
WITH TIES
. See: