PostgreSQL – Resolving Slow Queries Related to Subqueries Using Aggregation

Here's a simplified version of the queries I am running:

select
    ...
from
    editions
    join edition_events on edition_events.edition_id = editions.id
    join events on events.id = edition_events.event_id
    join (
        select
            event_id,
            array_agg(date_start),
            array_agg(date_end)
        from
            event_dates
        group by
            event_id
        ) as dates on dates.event_id = events.id
WHERE
    editions.id = ?;

These queries ran great with my test data (~1500 events). Now that I'm starting to use a copy of the real data (~400,000 events, ~500,000 event_dates), my application is timing out before I get results back from the database. I have added indexes on all relevant columns, which only helped a little bit. I've narrowed down the problem to the subqueries that use aggregation.

Now the real version of these subqueries does things like time zone manipulation or a count instead of array_agg, so using a trigger to update a counter column is not practical in many instances. The dates associated with events must be individual rows rather than stored as an array since they have a foreign key relationship with other tables.

What would be my best bet for improving performance here? I understand that the subquery is being evaluated first, but I don't know which dates I actually need to look at until I've done the joins. I have looked at lateral joins, but I am uncertain if they would help here (and most information I can find is along the lines of "look, new feature!"). Since I am still using 9.2, I don't want to ask to upgrade to 9.3+ if it won't actually solve my problem.

SELECT e.id , array_agg(ed.date_start) AS starts , array_agg(ed.date_end) AS ends , ... FROM editions e JOIN edition_events ee ON ee.edition_id = e.id JOIN events ev ON ev.id = ee.event_id JOIN event_dates ed ON ed.event_id = e.id WHERE e.id = ? GROUP BY e.id;

Best Answer

Your approach ("aggregate first, join later") generally performs best when you need to retrieve the whole table or large parts of it, or if you can restrict to relevant rows cheaply in the subquery - or while the aggregated table is small. See:

Query with LEFT JOIN not returning rows for count of 0

For only a few rows in the result, this gets disproportionately expensive for big tables, since the whole table (event_dates in your case) has to be processed. In these cases, it's better to revert the paradigm to "join first, aggregate later". The equivalent query would then be:

This should help with any version of Postgres.
Also using table aliases to simplify the syntax.

I don't see how LATERAL (Postgres 9.3+) could help this particular query. You would use it to reference a column from the left side of the join tree that you couldn't access on the same query level otherwise, often to produce multiple rows to join to. Neither is the case here. What you are thinking of can be implemented with plain old correlated subqueries, but those typically perform worse.
Don't take my word for it. Run a quick test with EXPLAIN ANALYZE for your case (with your real life data), then you know for sure.

Best Answer

Related Solutions

PostgreSQL – SQL Hourly Data Aggregation

Postgresql – Getting around the constraint “column must appear in the GROUP BY clause or be used in an aggregate function”

Related Question