Postgresql – Optimizing PostgreSQL query with multiple subqueries

postgresql

I have a rather simple question (I think): I have one query with 2 window functions: COUNT() and AVG(). I also have 10 subqueries, each with a window function (COUNT() or AVG()).

Testing in pgadmin, this query works fast. The problem is: I'm currently testing against only 100 rows.

Here is the query:

SELECT
    COUNT(a.review_id)::int AS total_reviews,
    COALESCE(AVG(a.score), 0)::int AS average_score,
    (SElECT COUNT(review_id) FROM data.users_pro_reviews WHERE user_id = $1 AND score = 1)::int AS total_1,
    (SElECT COUNT(review_id) FROM data.users_pro_reviews WHERE user_id = $1 AND score = 2)::int AS total_2,
    (SElECT COUNT(review_id) FROM data.users_pro_reviews WHERE user_id = $1 AND score = 3)::int AS total_3,
    (SElECT COUNT(review_id) FROM data.users_pro_reviews WHERE user_id = $1 AND score = 4)::int AS total_4,
    (SElECT COUNT(review_id) FROM data.users_pro_reviews WHERE user_id = $1 AND score = 5)::int AS total_5,
    COALESCE((SELECT AVG(score) FROM data.users_pro_reviews WHERE user_id = $1 AND timestamp_reviewed < NOW()), 0)::int as today,
    COALESCE((SELECT AVG(score) FROM data.users_pro_reviews WHERE user_id = $1 AND timestamp_reviewed < NOW() - INTERVAL '1 DAY'), 0)::int as yesterday,
    COALESCE((SELECT AVG(score) FROM data.users_pro_reviews WHERE user_id = $1 AND timestamp_reviewed < NOW() - INTERVAL '1 WEEK'), 0)::int as one_week_ago,
    COALESCE((SELECT AVG(score) FROM data.users_pro_reviews WHERE user_id = $1 AND timestamp_reviewed < NOW() - INTERVAL '1 MONTH'), 0)::int as one_month_ago,
    COALESCE((SELECT AVG(score) FROM data.users_pro_reviews WHERE user_id = $1 AND timestamp_reviewed < NOW() - INTERVAL '6 MONTH'), 0)::int as six_montsh_ago
FROM
    data.users_pro_reviews a
WHERE
    a.user_id = $1

How can this query be optimized? Using JOINS? And how should I use window functions in JOINs?

Best Answer

It's probably more performant to write it this way instead of querying the same table multiple times:

SELECT
    COUNT(a.review_id)::int AS total_reviews,
    COALESCE(AVG(a.score), 0)::int AS average_score,
    SUM(CASE WHEN score = 1 THEN 1 ELSE 0 END)::int AS total_1,
    SUM(CASE WHEN score = 2 THEN 1 ELSE 0 END)::int AS total_2,
    SUM(CASE WHEN score = 3 THEN 1 ELSE 0 END)::int AS total_3,
    SUM(CASE WHEN score = 4 THEN 1 ELSE 0 END)::int AS total_4,
    SUM(CASE WHEN score = 5 THEN 1 ELSE 0 END)::int AS total_5
/* ...similar syntax for your AVG calculations */    
FROM
    data.users_pro_reviews a
WHERE
    a.user_id = $1

Using the SUM function with a well crafted CASE statement becomes logically equivalent to how you were previously trying to COUNT things. You can also implement similar logic for the AVG function (which I'll probably update this answer with an example for).

As per @LaurenzAlbe, this is another example you can use leveraging the FILTER keyword:

SELECT
    COUNT(a.review_id)::int AS total_reviews,
    COALESCE(AVG(a.score), 0)::int AS average_score,
    COUNT(*) FILTER (WHERE score = 1)::int AS total_1,
    COUNT(*) FILTER (WHERE score = 2)::int AS total_2,
    COUNT(*) FILTER (WHERE score = 3)::int AS total_3,
    COUNT(*) FILTER (WHERE score = 4)::int AS total_4,
    COUNT(*) FILTER (WHERE score = 5)::int AS total_5
/* ...similar syntax for your AVG calculations */    
FROM
    data.users_pro_reviews a
WHERE
    a.user_id = $1

I do have to agree this syntax feels a lot cleaner.

Related Solutions

PostgreSQL – Count Data Using Multiple Date Ranges

select  industry_id
,       sum(case when current_date <= date then clicks end) as today 
,       sum(case when current_date-1 <= date and
                      date < current_date then clicks end) as yesterday
,       sum(case when current_date-4 <= date and 
                      date < current_date-1 then clicks end) as last3days
from    phone_clicks
group by
        industry_id

See it in your SQLFiddle.

PostgreSQL – How to Select Rolling Most Recent Value for Each Person for Each Month

2015-11-22 is not older than 6 months compared to 2015-05-01 unless you truncate the response date to month also.

> select '2015-11-22'::date + interval '6 months';
+---------------------+
| ?column?            |
|---------------------|
| 2016-05-22 00:00:00 |
+---------------------+

> select date_trunc('month', '2015-11-22'::date) + interval '6 months';
+---------------------------+
| ?column?                  |
|---------------------------|
| 2016-05-01 00:00:00+02:00 |
+---------------------------+

So depending the logic what you consider expired the query changes a bit. I've included both version. In comment the date and active condition as per the example.

Using Lateral join:

with users as (
    -- I assume you have user table so this can be omitted
    -- first and last are used to limit the join
    select user_id, min(date) as first, max(date) as last from responses group by user_id
), boundaries as (
    select 
        date_trunc('month', min(date)) as low 
        --, date_trunc('month', max(date)) as high 
        -- If you want to use the high in generate_series as upper boundary
    from responses
)
select user_id, tick as "date", score 
from 
    generate_series(
        (select low from boundaries), 
        date_trunc('month', current_date + interval '1 month'), interval '1 month'
    ) as tick
join users on tick.tick between users.first and users.last + interval '6 months'
-- # If you omit the users CTE and you have a users table
-- cross join users
left outer join lateral (
    select score from responses 
    where users.user_id = responses.user_id 
    -- # Proper 6 months calculation
    --and responses.date between tick.tick - interval '6 months' and tick.tick 
    -- # As example showed
    and responses.date < tick.tick and date_trunc('month', responses.date) + interval '6 months' > tick.tick 
    order by responses.date desc limit 1
) a on true
order by 1, 2;

Using Window functions:

with boundaries as (
    select 
        date_trunc('month', min(date)) as low
    from responses
)
select distinct responses.user_id, tick as "date", first_value(score) over (partition by responses.user_id, tick.tick order by responses.date desc)
from 
    generate_series(
        (select low from boundaries), 
        date_trunc('month', current_date + interval '1 month'), interval '1 month'
    ) as tick
join responses on (
    -- # Proper 6 months calculation
    -- responses.date between tick.tick - interval '6 months' and tick.tick 
    -- # As it was in the example
    responses.date < tick.tick and 
    date_trunc('month', responses.date) > tick.tick - interval '6 months'
) 
order by 1, 2;

Best Answer

Related Solutions

PostgreSQL – Count Data Using Multiple Date Ranges

PostgreSQL – How to Select Rolling Most Recent Value for Each Person for Each Month

Related Question