I have these tables I'd like to do a FULL OUTER JOIN
between. They both have an index on (ts_minute, user_id, campaign_id, host_id)
(which matches the join condition). In the following query, if I remove FULL OUTER
from the query, it runs 10,000 times faster because it uses the indices (confirmed using EXPLAIN SELECT ...
). Problem is, I need FULL OUTER
.
select user_id, campaign_id, sum(i.cost), sum(b.bids_internal), sum(b.eligible_requests)
from impr_summary_test i
join bid_summary_test b USING (ts_minute, user_id, campaign_id, host_id)
where ts_minute > now() - interval '5 min'
group by 1, 2 order by 1,2;
There are indices on just (ts_minute)
as well. It would be nice if it could use this index. The number of rows returned in the slow case is ~330, the number of rows returned in the fast case is ~220.
Postgres 9.3.
I've tried re-writing this as two separate sub queries, each with their own (identical) WHERE
clause, and then JOIN
'ing those results, and that does speed it up, but the basic INNER JOIN
case is still 5 times faster than the re-write.
How can I speed this up by getting it to use the indices?
Number of rows in bid_summary_test: 8,292,673
Number of rows in bid_summary_test that match predicate ts_minute > now() - interval '5 min'
: 28,849
Number of rows in impr_summary_test: 2,523,534
Number of rows in impr_summary_test that match predicate ts_minute > now() - interval '5 min'
: 11,081
Best Answer
When combining two tables with
FULL OUTER JOIN
all rows are included in the result. Typically, it doesn't make sense to use indexes for the joining columns. That's why attempts with indexes on(ts_minute, user_id, campaign_id, host_id)
will be fruitless. It just so happens that the first column also fits the relevant predicate onts_minute
. See below.The query also returns substantially fewer rows with an
[INNER] JOIN
, which explains some of the performance difference - but certainly not factor 10.000. Your updated info confirms that fewer rows are returned (220 vs. 330) but that's still not the whole story. The difference could be much bigger before aggregation.Your predicate on
ts_minute
is the better candidate for index support. There are two indices that would seem applicable for the predicatets_minute > now() - interval '5 min'
(pending information on actual data types). But the appearance is deceitful: The resulting columnts_minute
after theFULL [OUTER] JOIN
is a combination of the columns in both underlying tables. It's like an expression that's not sargable and cannot use indices onts_minute
on the underlying tables.You have to apply the predicate before the
FULL [OUTER] JOIN
to allow index support. That's what you already stumbled upon in your added test with two separate sub queries. You'll find index (only) scans or probably bitmap index scans if you inspect theEXPLAIN
output.Since you
GROUP BY user_id, campaign_id
, it is irrelevant whether you add the columnsts_minute
andhost_id
to the join condition in this query.ts_minute
is only necessary for theWHERE
condition.Assuming that only a very small percentage of the rows in each table lie within the "last 5 minutes", an index on
ts_minute
in both tables should make this very fast.If your write patterns and
autovacuum
allow index-only scans, these indices would give you best performance:Weigh cost and benefit for highly specialized indexes.