PostgreSQL – Resolving Slow Queries Related to Subqueries Using Aggregation

aggregateperformancepostgresqlpostgresql-9.2

Here's a simplified version of the queries I am running:

select
    ...
from
    editions
    join edition_events on edition_events.edition_id = editions.id
    join events on events.id = edition_events.event_id
    join (
        select
            event_id,
            array_agg(date_start),
            array_agg(date_end)
        from
            event_dates
        group by
            event_id
        ) as dates on dates.event_id = events.id
WHERE
    editions.id = ?;

These queries ran great with my test data (~1500 events). Now that I'm starting to use a copy of the real data (~400,000 events, ~500,000 event_dates), my application is timing out before I get results back from the database. I have added indexes on all relevant columns, which only helped a little bit. I've narrowed down the problem to the subqueries that use aggregation.

Now the real version of these subqueries does things like time zone manipulation or a count instead of array_agg, so using a trigger to update a counter column is not practical in many instances. The dates associated with events must be individual rows rather than stored as an array since they have a foreign key relationship with other tables.

What would be my best bet for improving performance here? I understand that the subquery is being evaluated first, but I don't know which dates I actually need to look at until I've done the joins. I have looked at lateral joins, but I am uncertain if they would help here (and most information I can find is along the lines of "look, new feature!"). Since I am still using 9.2, I don't want to ask to upgrade to 9.3+ if it won't actually solve my problem.

Best Answer

Your approach ("aggregate first, join later") generally performs best when you need to retrieve the whole table or large parts of it, or if you can restrict to relevant rows cheaply in the subquery - or while the aggregated table is small. See:

For only a few rows in the result, this gets disproportionately expensive for big tables, since the whole table (event_dates in your case) has to be processed. In these cases, it's better to revert the paradigm to "join first, aggregate later". The equivalent query would then be:

SELECT e.id
     , array_agg(ed.date_start) AS starts
     , array_agg(ed.date_end)   AS ends
     , ...
FROM   editions       e
JOIN   edition_events ee ON ee.edition_id = e.id
JOIN   events         ev ON ev.id = ee.event_id
JOIN   event_dates    ed ON ed.event_id = e.id
WHERE  e.id = ?
GROUP  BY e.id;

This should help with any version of Postgres.
Also using table aliases to simplify the syntax.

I don't see how LATERAL (Postgres 9.3+) could help this particular query. You would use it to reference a column from the left side of the join tree that you couldn't access on the same query level otherwise, often to produce multiple rows to join to. Neither is the case here. What you are thinking of can be implemented with plain old correlated subqueries, but those typically perform worse.
Don't take my word for it. Run a quick test with EXPLAIN ANALYZE for your case (with your real life data), then you know for sure.