PostgreSQL Query with Pagination is Slow When Using Merge Join – Performance Tips

join;pagingperformancepostgresqlpostgresql-performance

I am experimenting with trying to optimize a query that's used to iterate over pages of rows with a joined table. After changing the query from using OFFSET to WHERE id > last_id_from_prev_page and ORDER BY as described here, the query performs well in the beginning, but as it gets closer to the middle pages, it performs poorly.

Here's some minimal DDL to represent the data I'm working with:

-- DDL:
create table sub ( -- 4.3M records
    subscription_id     varchar(255) -- UUIDs
        not null constraint pk_sub primary key,
    subscription_status varchar(255) -- (ACTIVE, PENDING, or CANCELLED)
);

create table sub_item ( -- 4.4M records, mostly 1:1 with subscriptions
    subscription_item_id varchar(255) -- UUIDs
        not null constraint pk_sub_item primary key,
    subscription_id      varchar(255) -- FK to sub
        constraint fk_sub_item_sub_id references sub,
    item_ref             varchar(255) -- 25 character strings, mostly the same
);

create index idx_sub_item_sub_id_btree on sub_item (subscription_id);

And the query I'm using:

-- Query: Grab a page from the middle
EXPLAIN ANALYZE
SELECT s.subscription_id, *
FROM sub s
    JOIN sub_item si ON s.subscription_id = si.subscription_id AND si.item_ref = '1001107599999910222001000'
WHERE s.subscription_status = 'ACTIVE'
  AND s.SUBSCRIPTION_ID > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'
ORDER BY s.subscription_id
LIMIT 50;

By default, this query runs a Merge Join which performs poorly, often 10s+:

Limit  (cost=2.41..325.76 rows=50 width=179) (actual time=10336.095..10340.389 rows=50 loops=1)
  ->  Merge Join  (cost=2.41..765832.96 rows=118421 width=179) (actual time=10336.086..10339.900 rows=50 loops=1)
        Merge Cond: ((s.subscription_id)::text = (si.subscription_id)::text)
        ->  Index Scan using pk_sub on sub s  (cost=0.56..270648.21 rows=632913 width=45) (actual time=0.025..1.643 rows=90 loops=1)
              Index Cond: ((subscription_id)::text > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'::text)
              Filter: ((subscription_status)::text = 'ACTIVE'::text)
              Rows Removed by Filter: 211
        ->  Index Scan using idx_sub_item_sub_id_btree on sub_item si  (cost=0.56..490385.99 rows=813151 width=98) (actual time=0.016..8260.586 rows=393058 loops=1)
              Filter: ((item_ref)::text = '1001107599999910222001000'::text)
              Rows Removed by Filter: 1742475
Planning Time: 0.309 ms
Execution Time: 10340.691 ms

I'm not super experienced with reading these Postgres query plans, but it seems like the index scan on the sub_item table is where most of the time is spent. I found a couple of ways to circumvent this. One was to simply to SET enable_mergejoin = OFF;. Another was to use a HASH index type on sub_item.subscription_id. Both of those cause the planner to use a Nested Loop join, which performs significantly faster:

Limit  (cost=1.11..446.97 rows=50 width=179) (actual time=0.150..4.351 rows=50 loops=1)
  ->  Nested Loop  (cost=1.11..1055964.59 rows=118421 width=179) (actual time=0.140..3.850 rows=50 loops=1)
        ->  Index Scan using pk_sub on sub s  (cost=0.56..270648.21 rows=632913 width=45) (actual time=0.064..0.898 rows=90 loops=1)
              Index Cond: ((subscription_id)::text > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'::text)
              Filter: ((subscription_status)::text = 'ACTIVE'::text)
              Rows Removed by Filter: 211
        ->  Index Scan using idx_sub_item_sub_id_btree on sub_item si  (cost=0.56..1.23 rows=1 width=98) (actual time=0.013..0.016 rows=1 loops=90)
              Index Cond: ((subscription_id)::text = (s.subscription_id)::text)
              Filter: ((item_ref)::text = '1001107599999910222001000'::text)
              Rows Removed by Filter: 0
Planning Time: 0.196 ms
Execution Time: 4.623 ms

But it is not clear to me why the Merge Join is so slow. Based on my understanding, it is used when the tables are sorted or have sorted access (which these do, via the sub PK and sub_item.subscription_id btree index). The query without the join is almost instant, and direct lookups via sub_item.subscription_id are also almost instant.

So my two questions are:

  1. Why does the Merge Join perform so poorly with this type of query/pagination?
  2. Why does the Postgres query planner favor the much slower Merge Join in this case?

Best Answer

The planner doesn't understand how the restriction on s.subscription_id interacts with the merge join. The index scan on "s" gets to quickly skip about half the tuples by jumping to the middle of the index, while the index scan on "si" then will read through all the tuples that correspond to the quickly-skipped ones from table "s" before it starts finding any tuples that match to return (and so count against the LIMIT).

You could get the index scan on "si" to also quickly skip those tuples by replicating the subscription_id condition:

SELECT s.subscription_id, *
FROM sub s
    JOIN sub_item si ON s.subscription_id = si.subscription_id AND si.item_ref = '1001107599999910222001000' AND
                         si.subscription_id > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'
WHERE s.subscription_status = 'ACTIVE'
  AND s.SUBSCRIPTION_ID > '7ca1cf6b-2452-4d1b-bd03-1ba68b63b528'
ORDER BY s.subscription_id
LIMIT 50;

I put the extra test in the ON rather than the WHERE, because if you were to convert this to a LEFT JOIN it needs to be in the ON. Since it is an inner JOIN, then it doesn't matter where you put it, so might as well put it in the safer place to guard against future changes.

Often the planner is smart enough to realize that if a.x=b.x then any test against a.x can be automatically applied to b.x as well. I don't understand the details of why it doesn't do that here.