PostgreSQL Performance – Indices Not Used in FULL OUTER JOIN

indexjoin;performancepostgresqlpostgresql-performance

I have these tables I'd like to do a FULL OUTER JOIN between. They both have an index on (ts_minute, user_id, campaign_id, host_id) (which matches the join condition). In the following query, if I remove FULL OUTER from the query, it runs 10,000 times faster because it uses the indices (confirmed using EXPLAIN SELECT ...). Problem is, I need FULL OUTER.

select user_id, campaign_id, sum(i.cost), sum(b.bids_internal), sum(b.eligible_requests)
    from impr_summary_test i
    join bid_summary_test b USING (ts_minute, user_id, campaign_id, host_id)
    where ts_minute > now() - interval '5 min'
    group by 1, 2 order by 1,2;

There are indices on just (ts_minute) as well. It would be nice if it could use this index. The number of rows returned in the slow case is ~330, the number of rows returned in the fast case is ~220.

Postgres 9.3.

I've tried re-writing this as two separate sub queries, each with their own (identical) WHERE clause, and then JOIN'ing those results, and that does speed it up, but the basic INNER JOIN case is still 5 times faster than the re-write.

How can I speed this up by getting it to use the indices?

Number of rows in bid_summary_test: 8,292,673

Number of rows in bid_summary_test that match predicate ts_minute > now() - interval '5 min' : 28,849

Number of rows in impr_summary_test: 2,523,534

Number of rows in impr_summary_test that match predicate ts_minute > now() - interval '5 min' : 11,081

Best Answer

When combining two tables with FULL OUTER JOIN all rows are included in the result. Typically, it doesn't make sense to use indexes for the joining columns. That's why attempts with indexes on (ts_minute, user_id, campaign_id, host_id) will be fruitless. It just so happens that the first column also fits the relevant predicate on ts_minute. See below.

The query also returns substantially fewer rows with an [INNER] JOIN, which explains some of the performance difference - but certainly not factor 10.000. Your updated info confirms that fewer rows are returned (220 vs. 330) but that's still not the whole story. The difference could be much bigger before aggregation.

Your predicate on ts_minute is the better candidate for index support. There are two indices that would seem applicable for the predicate ts_minute > now() - interval '5 min' (pending information on actual data types). But the appearance is deceitful: The resulting column ts_minute after the FULL [OUTER] JOIN is a combination of the columns in both underlying tables. It's like an expression that's not sargable and cannot use indices on ts_minute on the underlying tables.

You have to apply the predicate before the FULL [OUTER] JOIN to allow index support. That's what you already stumbled upon in your added test with two separate sub queries. You'll find index (only) scans or probably bitmap index scans if you inspect the EXPLAIN output.

Since you GROUP BY user_id, campaign_id, it is irrelevant whether you add the columns ts_minute and host_id to the join condition in this query. ts_minute is only necessary for the WHERE condition.

SELECT user_id, campaign_id
     , i.sum_cost, b.sum_bids_internal, b.sum_eligible_requests
FROM  (
   SELECT user_id, campaign_id
        , sum(cost) AS sum_cost
   FROM   impr_summary_test
   WHERE  ts_minute > now() - interval '5 min'
   GROUP  BY 1, 2
   ) i 
FULL JOIN (
   SELECT user_id, campaign_id
        , sum(bids_internal) AS sum_bids_internal
        , sum(eligible_requests) AS sum_eligible_requests
   FROM   bid_summary_test
   WHERE  ts_minute > now() - interval '5 min'
   GROUP  BY 1, 2
   ) b USING (user_id, campaign_id)
ORDER  BY 1,2;

Assuming that only a very small percentage of the rows in each table lie within the "last 5 minutes", an index on ts_minute in both tables should make this very fast.

If your write patterns and autovacuum allow index-only scans, these indices would give you best performance:

CREATE INDEX ON impr_summary_test (ts_minute, user_id, campaign_id, cost);
CREATE INDEX ON bid_summary_test  (ts_minute, user_id, campaign_id, bids_internal, eligible_requests);

Weigh cost and benefit for highly specialized indexes.