Postgresql – Avoid duplicate WHERE clause on both sides of a LEFT JOIN, without changing semantics or impairing query optimization

optimizationpostgresqlquery-performance

I have a table recording the results of a web crawl. (Table
definition at the end of this question.) The relevant part of the data
stored in it looks like this:

 experiment_id | url_id | redirect_num | full_url_id | has_blockpage_regex | blockpage_reason_id
---------------+--------+--------------+-------------+---------------------+---------------------
            16 |   1312 |            0 |        1312 |                   f |
            16 |   1312 |            1 |        2311 |                   f |
            16 |   1312 |            2 |        2312 |                   f |
            16 |   1312 |            3 |        2313 |                   f |
            43 |   1320 |            0 |        1320 |                   f |
            43 |   1320 |            2 |        2312 |                   f |
            43 |   1320 |            1 |        2317 |                   t |                   1
            43 |   1320 |            3 |        2318 |                   f |

For each (experiment_id, url_id) pair, from a small set of experiment IDs, I want to query the full_url_id associated with the largest value of redirect_num, and the blockpage_reason_id associated with the smallest value of redirect_num for which has_blockpage_regex is true. (If there is no row satisfying the latter condition, blockpage_reason_id should come out null.) For the data above, the desired results would be like this:

 experiment_id | url_id | full_url_id | blockpage_reason_id
---------------+--------+-------------+---------------------
            16 |   1312 |        2313 |
            43 |   1320 |        2318 |                   1

I have a query that does what I want, but it's extremely slow, involving multiple full table scans.
(EXPLAIN ANALYZE dump also at the end of the question.)

select x.experiment_id, x.url_id, x.full_url_id, y.blockpage_reason_id from (
  select bm.experiment_id, bm.url_id, b.full_url_id
    from blockpage b
    join (select experiment_id, url_id, max(redirect_num) as redirect_num
            from blockpage
           group by experiment_id, url_id) bm
      on b.experiment_id = bm.experiment_id
     and b.url_id = bm.url_id
     and b.redirect_num = bm.redirect_num
) x left join (
  select bm.experiment_id, bm.url_id,
         b.has_blockpage_regex, b.blockpage_reason_id
    from blockpage b
    join (select experiment_id, url_id, max(redirect_num) as redirect_num
            from blockpage
           where has_blockpage_regex
           group by experiment_id, url_id) bm
      on b.experiment_id = bm.experiment_id
     and b.url_id = bm.url_id
     and b.redirect_num = bm.redirect_num
) y on x.experiment_id = y.experiment_id
   and x.url_id = y.url_id
 where x.experiment_id in (16,43);

I can make it work efficiently by duplicating the WHERE clause at the very end into the subselects on both sides of the LEFT JOIN:

select x.experiment_id, x.url_id, x.full_url_id, y.blockpage_reason_id from (
  select bm.experiment_id, bm.url_id, b.full_url_id
    from blockpage b
    join (select experiment_id, url_id, max(redirect_num) as redirect_num
            from blockpage
           group by experiment_id, url_id) bm
      on b.experiment_id = bm.experiment_id
     and b.url_id = bm.url_id
     and b.redirect_num = bm.redirect_num
   where bm.experiment_id in (16,43)         -- DUPLICATED
) x left join (
  select bm.experiment_id, bm.url_id,
         b.has_blockpage_regex, b.blockpage_reason_id
    from blockpage b
    join (select experiment_id, url_id, max(redirect_num) as redirect_num
            from blockpage
           where has_blockpage_regex
           group by experiment_id, url_id) bm
      on b.experiment_id = bm.experiment_id
     and b.url_id = bm.url_id
     and b.redirect_num = bm.redirect_num
   where bm.experiment_id in (16,43)         -- DUPLICATED
) y on x.experiment_id = y.experiment_id
   and x.url_id = y.url_id;

But this will not work in practice, because the query (without the WHERE) is going to be used as a view definition, and people will select with WHEREs from the view, which is equivalent to applying a single WHERE to the outermost SELECT.

So my question is, how do I rewrite this query so that WHEREs on the outermost SELECT will be executed as efficiently as they are if I push them inside the LEFT JOIN manually? Change it as much as you want, as long as it still produces the same results as the example at the top.

Database is Postgres 12.2.


EXPLAIN ANALYZE for the original query:

 Nested Loop Left Join  (cost=1315730.90..3856617.09 rows=20 width=16) (actual time=36112.365..55406.053 rows=501 loops=1)
   Output: blockpage.experiment_id, blockpage.url_id, b.full_url_id, b_1.blockpage_reason_id
   Join Filter: ((blockpage.experiment_id = blockpage_1.experiment_id) AND (blockpage.url_id = blockpage_1.url_id))
   Rows Removed by Join Filter: 220714047
   ->  Nested Loop  (cost=1.14..88505.96 rows=20 width=12) (actual time=466.302..480.840 rows=501 loops=1)
         Output: b.full_url_id, blockpage.experiment_id, blockpage.url_id
         ->  GroupAggregate  (cost=0.57..15442.73 rows=8543 width=12) (actual time=466.267..470.146 rows=501 loops=1)
               Output: blockpage.experiment_id, blockpage.url_id, max(blockpage.redirect_num)
               Group Key: blockpage.experiment_id, blockpage.url_id
               ->  Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage  (cost=0.57..15293.19 rows=8547 width=12) (actual time=0.048..1.662 rows=803 loops=1)
                     Output: blockpage.experiment_id, blockpage.url_id, blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id
                     Index Cond: (blockpage.experiment_id = ANY ('{16,43}'::integer[]))
                     Heap Fetches: 803
         ->  Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b  (cost=0.57..8.53 rows=1 width=16) (actual time=0.016..0.018 rows=1 loops=501)
               Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id
               Index Cond: ((b.experiment_id = blockpage.experiment_id) AND (b.url_id = blockpage.url_id) AND (b.redirect_num = (max(blockpage.redirect_num))))
               Heap Fetches: 501
   ->  Materialize  (cost=1315729.77..3767691.70 rows=1207 width=12) (actual time=7.067..89.428 rows=440547 loops=501)
         Output: b_1.blockpage_reason_id, blockpage_1.experiment_id, blockpage_1.url_id
         ->  Hash Join  (cost=1315729.77..3767685.66 rows=1207 width=12) (actual time=3540.653..35478.463 rows=440547 loops=1)
               Output: b_1.blockpage_reason_id, blockpage_1.experiment_id, blockpage_1.url_id
               Inner Unique: true
               Hash Cond: ((b_1.experiment_id = blockpage_1.experiment_id) AND (b_1.url_id = blockpage_1.url_id) AND (b_1.redirect_num = (max(blockpage_1.redirect_num))))
               ->  Seq Scan on iclab.blockpage b_1  (cost=0.00..1757677.88 rows=88162288 width=16) (actual time=0.012..9535.931 rows=88164599 loops=1)
                     Output: b_1.id, b_1.url_id, b_1.full_url_id, b_1.experiment_id, b_1.blockpage_reason_id, b_1.html_tag_id, b_1.body_len, b_1.blockpage_diff, b_1.has_blockpage_regex, b_1.http_status, b_1.redirect_num, b_1.response_failure
               ->  Hash  (cost=1306621.31..1306621.31 rows=520483 width=12) (actual time=3538.005..3538.005 rows=440547 loops=1)
                     Output: blockpage_1.experiment_id, blockpage_1.url_id, (max(blockpage_1.redirect_num))
                     Buckets: 524288  Batches: 1  Memory Usage: 23026kB
                     ->  Finalize HashAggregate  (cost=1296211.65..1301416.48 rows=520483 width=12) (actual time=3325.126..3427.920 rows=440547 loops=1)
                           Output: blockpage_1.experiment_id, blockpage_1.url_id, max(blockpage_1.redirect_num)
                           Group Key: blockpage_1.experiment_id, blockpage_1.url_id
                           ->  Gather  (cost=1246069.28..1292868.83 rows=445710 width=12) (actual time=3029.657..3143.505 rows=441870 loops=1)
                                 Output: blockpage_1.experiment_id, blockpage_1.url_id, (PARTIAL max(blockpage_1.redirect_num))
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 JIT for worker 0:
                                   Functions: 9
                                   Options: Inlining true, Optimization true, Expressions true, Deforming true
                                   Timing: Generation 1.136 ms, Inlining 51.049 ms, Optimization 73.089 ms, Emission 39.722 ms, Total 164.995 ms
                                 JIT for worker 1:
                                   Functions: 9
                                   Options: Inlining true, Optimization true, Expressions true, Deforming true
                                   Timing: Generation 2.075 ms, Inlining 72.675 ms, Optimization 70.854 ms, Emission 39.490 ms, Total 185.093 ms
                                 ->  Partial HashAggregate  (cost=1245069.28..1247297.83 rows=222855 width=12) (actual time=3004.908..3047.674 rows=147290 loops=3)
                                       Output: blockpage_1.experiment_id, blockpage_1.url_id, PARTIAL max(blockpage_1.redirect_num)
                                       Group Key: blockpage_1.experiment_id, blockpage_1.url_id
                                       Worker 0: actual time=2999.661..3042.358 rows=137916 loops=1
                                       Worker 1: actual time=2985.698..3026.195 rows=144767 loops=1
                                       ->  Parallel Seq Scan on iclab.blockpage blockpage_1  (cost=0.00..1243397.87 rows=222855 width=12) (actual time=115.822..2924.688 rows=177783 loops=3)
                                             Output: blockpage_1.id, blockpage_1.url_id, blockpage_1.full_url_id, blockpage_1.experiment_id, blockpage_1.blockpage_reason_id, blockpage_1.html_tag_id, blockpage_1.body_len, blockpage_1.blockpage_diff, blockpage_1.has_blockpage_regex, blockpage_1.http_status, blockpage_1.redirect_num, blockpage_1.response_failure
                                             Filter: blockpage_1.has_blockpage_regex
                                             Rows Removed by Filter: 29210417
                                             Worker 0: actual time=164.165..2922.106 rows=165851 loops=1
                                             Worker 1: actual time=183.286..2912.232 rows=173763 loops=1
 Planning Time: 1.013 ms
 JIT:
   Functions: 60
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 11.387 ms, Inlining 132.486 ms, Optimization 415.860 ms, Emission 264.166 ms, Total 823.898 ms
 Execution Time: 55458.769 ms

EXPLAIN ANALYZE for the query with duplicate WHEREs:

 Merge Left Join  (cost=2.27..104247.90 rows=20 width=16) (actual time=40.938..44.044 rows=501 loops=1)
   Output: blockpage.experiment_id, blockpage.url_id, b.full_url_id, b_1.blockpage_reason_id
   Merge Cond: ((blockpage.experiment_id = blockpage_1.experiment_id) AND (blockpage.url_id = blockpage_1.url_id))
   ->  Nested Loop  (cost=1.14..88505.96 rows=20 width=12) (actual time=40.603..43.615 rows=501 loops=1)
         Output: b.full_url_id, blockpage.experiment_id, blockpage.url_id
         ->  GroupAggregate  (cost=0.57..15442.73 rows=8543 width=12) (actual time=40.561..41.298 rows=501 loops=1)
               Output: blockpage.experiment_id, blockpage.url_id, max(blockpage.redirect_num)
               Group Key: blockpage.experiment_id, blockpage.url_id
               ->  Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage  (cost=0.57..15293.19 rows=8547 width=12) (actual time=0.045..0.432 rows=803 loops=1)
                     Output: blockpage.experiment_id, blockpage.url_id, blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id
                     Index Cond: (blockpage.experiment_id = ANY ('{16,43}'::integer[]))
                     Heap Fetches: 803
         ->  Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b  (cost=0.57..8.53 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=501)
               Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id
               Index Cond: ((b.experiment_id = blockpage.experiment_id) AND (b.url_id = blockpage.url_id) AND (b.redirect_num = (max(blockpage.redirect_num))))
               Heap Fetches: 501
   ->  Materialize  (cost=1.14..15741.83 rows=1 width=12) (actual time=0.329..0.329 rows=0 loops=1)
         Output: b_1.blockpage_reason_id, blockpage_1.experiment_id, blockpage_1.url_id
         ->  Nested Loop  (cost=1.14..15741.82 rows=1 width=12) (actual time=0.326..0.326 rows=0 loops=1)
               Output: b_1.blockpage_reason_id, blockpage_1.experiment_id, blockpage_1.url_id
               ->  GroupAggregate  (cost=0.57..15294.10 rows=52 width=12) (actual time=0.325..0.325 rows=0 loops=1)
                     Output: blockpage_1.experiment_id, blockpage_1.url_id, max(blockpage_1.redirect_num)
                     Group Key: blockpage_1.experiment_id, blockpage_1.url_id
                     ->  Index Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage blockpage_1  (cost=0.57..15293.19 rows=52 width=12) (actual time=0.324..0.325 rows=0 loops=1)
                           Output: blockpage_1.id, blockpage_1.url_id, blockpage_1.full_url_id, blockpage_1.experiment_id, blockpage_1.blockpage_reason_id, blockpage_1.html_tag_id, blockpage_1.body_len, blockpage_1.blockpage_diff, blockpage_1.has_blockpage_regex, blockpage_1.http_status, blockpage_1.redirect_num, blockpage_1.response_failure
                           Index Cond: (blockpage_1.experiment_id = ANY ('{16,43}'::integer[]))
                           Filter: blockpage_1.has_blockpage_regex
                           Rows Removed by Filter: 803
               ->  Index Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b_1  (cost=0.57..8.59 rows=1 width=16) (never executed)
                     Output: b_1.id, b_1.url_id, b_1.full_url_id, b_1.experiment_id, b_1.blockpage_reason_id, b_1.html_tag_id, b_1.body_len, b_1.blockpage_diff, b_1.has_blockpage_regex, b_1.http_status, b_1.redirect_num, b_1.response_failure
                     Index Cond: ((b_1.experiment_id = blockpage_1.experiment_id) AND (b_1.url_id = blockpage_1.url_id) AND (b_1.redirect_num = (max(blockpage_1.redirect_num))))
 Planning Time: 1.004 ms
 JIT:
   Functions: 33
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 6.438 ms, Inlining 0.000 ms, Optimization 1.856 ms, Emission 38.146 ms, Total 46.440 ms
 Execution Time: 50.745 ms

Table definition:

CREATE TABLE iclab.blockpage (
    id                          BIGSERIAL   PRIMARY KEY,
    url_id                      INTEGER     NOT NULL,
    full_url_id                 INTEGER     NOT NULL,
    experiment_id               INTEGER     NOT NULL,
    blockpage_reason_id         INTEGER,
    html_tag_id                 INTEGER     NOT NULL,
    body_len                    INTEGER,
    blockpage_diff              REAL,
    has_blockpage_regex         BOOLEAN,
    http_status                 INTEGER,
    redirect_num                INTEGER     NOT NULL,
    response_failure            INTEGER,
    FOREIGN KEY (experiment_id)         REFERENCES iclab.experiment_results(id),
    FOREIGN KEY(url_id)                 REFERENCES iclab.URL(id),
    FOREIGN KEY(url_id)                 REFERENCES iclab.URL(id),
    FOREIGN KEY(blockpage_reason_id)    REFERENCES iclab.blockpage_reason(id),
    FOREIGN KEY(html_tag_id)            REFERENCES iclab.html_tag(id)
);

CREATE UNIQUE INDEX blockpage_experiment_id_url_id_redirect_num_blockpage_reason__
    ON iclab.blockpage(experiment_id , url_id, full_url_id, redirect_num, html_tag_id);

Best Answer

I have found a way to do this efficiently, using lateral joins instead of ordinary subqueries. As a bonus, the SQL is also easier to read.

SELECT b.experiment_id, b.url_id, x.full_url_id, y.blockpage_reason_id
  FROM blockpage b
  JOIN LATERAL (
    SELECT experiment_id, url_id, full_url_id
      FROM blockpage
     WHERE experiment_id = b.experiment_id AND url_id = b.url_id
     ORDER BY redirect_num DESC
     LIMIT 1
  ) x ON TRUE
  LEFT JOIN LATERAL (
    SELECT experiment_id, url_id, blockpage_reason_id
      FROM blockpage
     WHERE experiment_id = b.experiment_id AND url_id = b.url_id AND has_blockpage_regex
     ORDER BY redirect_num ASC
     LIMIT 1
  ) y ON TRUE
 WHERE b.experiment_id IN (16, 43);

The documentation describes "LATERAL" as letting you refer, in a subquery expression, to tables from earlier entries in the FROM-list, and that's true, but saying it that way doesn't really clue one in as to why one might want that. Its real power is that it directs the database to run a subquery separately for each row of the virtual table-so-far, as defined by the parent query. In this case, that means that the first inner query

    SELECT experiment_id, url_id, full_url_id
      FROM blockpage
     WHERE experiment_id = b.experiment_id AND url_id = b.url_id
     ORDER BY redirect_num DESC
     LIMIT 1

will, for each value of b.experiment_id and b.url_id, pull a set of rows from blockpage, sort them, and take the first. This is exactly what I was trying to achieve with the "join (select max(redirect_num) from blockpage ...)" construct in the original query, but the planner handles it much more efficiently. It winds up being even faster than the version with the WHERE duplicated.

 Nested Loop Left Join  (cost=17.76..162686.21 rows=8547 width=20) (actual time=17.116..28.811 rows=803 loops=1)
   Output: b.experiment_id, b.url_id, blockpage.full_url_id, blockpage.redirect_num, blockpage_1.blockpage_reason_id
   ->  Nested Loop  (cost=9.17..88989.70 rows=8547 width=16) (actual time=17.096..23.644 rows=803 loops=1)
         Output: b.experiment_id, b.url_id, blockpage.full_url_id, blockpage.redirect_num
         ->  Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b  (cost=0.57..15293.19 rows=8547 width=8) (actual time=0.044..0.523 rows=803 loops=1)
               Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id
               Index Cond: (b.experiment_id = ANY ('{16,43}'::integer[]))
               Heap Fetches: 803
         ->  Limit  (cost=8.60..8.60 rows=1 width=16) (actual time=0.007..0.007 rows=1 loops=803)
               Output: NULL::integer, NULL::integer, blockpage.full_url_id, blockpage.redirect_num
               ->  Sort  (cost=8.60..8.60 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=803)
                     Output: NULL::integer, NULL::integer, blockpage.full_url_id, blockpage.redirect_num
                     Sort Key: blockpage.redirect_num DESC
                     Sort Method: quicksort  Memory: 25kB
                     ->  Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage  (cost=0.57..8.59 rows=1 width=16) (actual time=0.004..0.005 rows=2 loops=803)
                           Output: NULL::integer, NULL::integer, blockpage.full_url_id, blockpage.redirect_num
                           Index Cond: ((blockpage.experiment_id = b.experiment_id) AND (blockpage.url_id = b.url_id))
                           Heap Fetches: 1565
   ->  Limit  (cost=8.60..8.60 rows=1 width=16) (actual time=0.006..0.006 rows=0 loops=803)
         Output: NULL::integer, NULL::integer, blockpage_1.blockpage_reason_id, blockpage_1.redirect_num
         ->  Sort  (cost=8.60..8.60 rows=1 width=16) (actual time=0.005..0.005 rows=0 loops=803)
               Output: NULL::integer, NULL::integer, blockpage_1.blockpage_reason_id, blockpage_1.redirect_num
               Sort Key: blockpage_1.redirect_num
               Sort Method: quicksort  Memory: 25kB
               ->  Index Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage blockpage_1  (cost=0.57..8.59 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=803)
                     Output: NULL::integer, NULL::integer, blockpage_1.blockpage_reason_id, blockpage_1.redirect_num
                     Index Cond: ((blockpage_1.experiment_id = b.experiment_id) AND (blockpage_1.url_id = b.url_id))
                     Filter: blockpage_1.has_blockpage_regex
                     Rows Removed by Filter: 2
 Planning Time: 0.382 ms
 JIT:
   Functions: 19
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 3.292 ms, Inlining 0.000 ms, Optimization 0.797 ms, Emission 15.897 ms, Total 19.986 ms
 Execution Time: 32.324 ms