I am experimenting with trying to optimize a query that's used to iterate over pages of rows with a joined table. After changing the query from using OFFSET
to WHERE id > last_id_from_prev_page
and ORDER BY
as described here, the query performs well in the beginning, but as it gets closer to the middle pages, it performs poorly.
Here's some minimal DDL to represent the data I'm working with:
-- DDL:
create table sub ( -- 4.3M records
subscription_id varchar(255) -- UUIDs
not null constraint pk_sub primary key,
subscription_status varchar(255) -- (ACTIVE, PENDING, or CANCELLED)
);
create table sub_item ( -- 4.4M records, mostly 1:1 with subscriptions
subscription_item_id varchar(255) -- UUIDs
not null constraint pk_sub_item primary key,
subscription_id varchar(255) -- FK to sub
constraint fk_sub_item_sub_id references sub,
item_ref varchar(255) -- 25 character strings, mostly the same
);
create index idx_sub_item_sub_id_btree on sub_item (subscription_id);
And the query I'm using:
-- Query: Grab a page from the middle
EXPLAIN ANALYZE
SELECT s.subscription_id, *
FROM sub s
JOIN sub_item si ON s.subscription_id = si.subscription_id AND si.item_ref = '1001107599999910222001000'
WHERE s.subscription_status = 'ACTIVE'
AND s.SUBSCRIPTION_ID > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'
ORDER BY s.subscription_id
LIMIT 50;
By default, this query runs a Merge Join which performs poorly, often 10s+:
Limit (cost=2.41..325.76 rows=50 width=179) (actual time=10336.095..10340.389 rows=50 loops=1)
-> Merge Join (cost=2.41..765832.96 rows=118421 width=179) (actual time=10336.086..10339.900 rows=50 loops=1)
Merge Cond: ((s.subscription_id)::text = (si.subscription_id)::text)
-> Index Scan using pk_sub on sub s (cost=0.56..270648.21 rows=632913 width=45) (actual time=0.025..1.643 rows=90 loops=1)
Index Cond: ((subscription_id)::text > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'::text)
Filter: ((subscription_status)::text = 'ACTIVE'::text)
Rows Removed by Filter: 211
-> Index Scan using idx_sub_item_sub_id_btree on sub_item si (cost=0.56..490385.99 rows=813151 width=98) (actual time=0.016..8260.586 rows=393058 loops=1)
Filter: ((item_ref)::text = '1001107599999910222001000'::text)
Rows Removed by Filter: 1742475
Planning Time: 0.309 ms
Execution Time: 10340.691 ms
I'm not super experienced with reading these Postgres query plans, but it seems like the index scan on the sub_item
table is where most of the time is spent. I found a couple of ways to circumvent this. One was to simply to SET enable_mergejoin = OFF;
. Another was to use a HASH
index type on sub_item.subscription_id
. Both of those cause the planner to use a Nested Loop join, which performs significantly faster:
Limit (cost=1.11..446.97 rows=50 width=179) (actual time=0.150..4.351 rows=50 loops=1)
-> Nested Loop (cost=1.11..1055964.59 rows=118421 width=179) (actual time=0.140..3.850 rows=50 loops=1)
-> Index Scan using pk_sub on sub s (cost=0.56..270648.21 rows=632913 width=45) (actual time=0.064..0.898 rows=90 loops=1)
Index Cond: ((subscription_id)::text > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'::text)
Filter: ((subscription_status)::text = 'ACTIVE'::text)
Rows Removed by Filter: 211
-> Index Scan using idx_sub_item_sub_id_btree on sub_item si (cost=0.56..1.23 rows=1 width=98) (actual time=0.013..0.016 rows=1 loops=90)
Index Cond: ((subscription_id)::text = (s.subscription_id)::text)
Filter: ((item_ref)::text = '1001107599999910222001000'::text)
Rows Removed by Filter: 0
Planning Time: 0.196 ms
Execution Time: 4.623 ms
But it is not clear to me why the Merge Join is so slow. Based on my understanding, it is used when the tables are sorted or have sorted access (which these do, via the sub
PK and sub_item.subscription_id
btree index). The query without the join is almost instant, and direct lookups via sub_item.subscription_id
are also almost instant.
So my two questions are:
- Why does the Merge Join perform so poorly with this type of query/pagination?
- Why does the Postgres query planner favor the much slower Merge Join in this case?
Best Answer
The planner doesn't understand how the restriction on s.subscription_id interacts with the merge join. The index scan on "s" gets to quickly skip about half the tuples by jumping to the middle of the index, while the index scan on "si" then will read through all the tuples that correspond to the quickly-skipped ones from table "s" before it starts finding any tuples that match to return (and so count against the LIMIT).
You could get the index scan on "si" to also quickly skip those tuples by replicating the subscription_id condition:
I put the extra test in the ON rather than the WHERE, because if you were to convert this to a LEFT JOIN it needs to be in the ON. Since it is an inner JOIN, then it doesn't matter where you put it, so might as well put it in the safer place to guard against future changes.
Often the planner is smart enough to realize that if a.x=b.x then any test against a.x can be automatically applied to b.x as well. I don't understand the details of why it doesn't do that here.