Postgresql – Why are correlated subqueries sometimes faster than joins in Postgres

join;postgresqlsubquery

This relates to the dataset described in Postgres is performing sequential scan instead of index scan

I've started working on adapting the import logic to work with a more normalised schema – no surprises here it's faster and more compact – but I've hit a roadblock updating the existing data: adding and updating with relevant foreign keys is taking an age.

UPDATE pages
SET id_site = id 
FROM sites
WHERE sites.url = pages."urlShort"
    AND "labelDate" = '2015-01-15'

NB pages."urlShort" and sites.url are text fields, both are indexed but currently have no explicit relationship.

There are around 500,000 rows for each date value and updates like this are taking around 2.5 hours. 🙁

I looked at what the underlying query might look at:

select * 
from pages
join sites on
sites.url = pages."urlShort"
where "labelDate" = '2015-01-01'

This takes around 6 minutes to run has query plan like this:

Hash Join  (cost=80226.81..934763.02 rows=493018 width=365)
  Hash Cond: ((pages."urlShort")::text = sites.url)
  ->  Bitmap Heap Scan on pages  (cost=13549.32..803595.26 rows=493018 width=315)
        Recheck Cond: ("labelDate" = '2015-01-01'::date)
        ->  Bitmap Index Scan on "pages_labelDate_idx  (cost=0.00..13426.07 rows=493018 width=0)
              Index Cond: ("labelDate" = '2015-01-01'::date)
  ->  Hash  (cost=30907.66..30907.66 rows=1606466 width=50)
        ->  Seq Scan on sites  (cost=0.00..30907.66 rows=1606466 width=50)

There are two things I'd like to know:

  1. Can I adjust the update to run faster based on the above?
  2. What parts of the query plan are telltales for running slow? Or do you always have to run EXPLAIN ANALYZE to find it out?

Here's the output from running EXPLAIN ANALYZE during the update:

Update on pages  (cost=65108.49..1545071.72 rows=4037902 width=331) (actual time=48181951.584..48181951.584 rows=0 loops=1)
  ->  Hash Join  (cost=65108.49..1545071.72 rows=4037902 width=331) (actual time=4075.973..1200902.835 rows=1909499 loops=1)
        Hash Cond: ((pages."urlShort")::text = sites.url)
        ->  Seq Scan on pages  (cost=0.00..1056057.95 rows=4037902 width=317) (actual time=0.025..456909.895 rows=2053904 loops=1)
              Filter: ((id_site IS NULL) AND ("labelDate" < '2015-09-01'::date))
              Rows Removed by Filter: 12105346
        ->  Hash  (cost=30907.66..30907.66 rows=1606466 width=41) (actual time=4061.106..4061.106 rows=1606489 loops=1)
              Buckets: 2097152  Batches: 2  Memory Usage: 74179kB
              ->  Seq Scan on sites  (cost=0.00..30907.66 rows=1606466 width=41) (actual time=0.024..869.068 rows=1606489 loops=1)
Planning time: 3.767 ms
Execution time: 48181966.394 ms

I've tried different memory settings (from 128MB to 1024MB shared_buffers) but they don't seem to make much difference.

Best Answer

To start, a correlated subquery can't return more than one row. Take for instance.

SELECT t.x
FROM ( VALUES (1),(2) )
  AS t(x)
LEFT OUTER JOIN (
  VALUES (1),(1)
) AS y(x)
USING (x);
 x 
---
 1
 1
 2
(3 rows)

Versus the same query written as a correlated subquery

SELECT t.x, (
  SELECT *
  FROM ( VALUES (1),(1) )
    AS y(x)
  WHERE y.x = t.x
)
FROM ( VALUES (1),(2) )
  AS t(x);
ERROR:  more than one row returned by a subquery used as an expression

That tells the planner something useful, and it can optimize that. Adjusting for this, by reducing the inner table to one row -- you'll see the plans:

EXPLAIN ANALYZE SELECT t.x
FROM ( VALUES (1),(2) )
  AS t(x)
LEFT OUTER JOIN (
  VALUES (1),(1)
) AS y(x)
USING (x);

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=0.05..0.10 rows=2 width=4) (actual time=0.019..0.021 rows=3 loops=1)
   Hash Cond: ("*VALUES*".column1 = "*VALUES*_1".column1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=4) (actual time=0.002..0.003 rows=2 loops=1)
   ->  Hash  (cost=0.03..0.03 rows=2 width=4) (actual time=0.005..0.005 rows=2 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Values Scan on "*VALUES*_1"  (cost=0.00..0.03 rows=2 width=4) (actual time=0.001..0.003 rows=2 loops=1)
 Planning time: 0.107 ms
 Execution time: 0.046 ms
(8 rows)

Correlated subquery method.

EXPLAIN ANALYZE SELECT t.x, (
  SELECT *
  FROM ( VALUES (1) )
    AS y(x)
  WHERE y.x = t.x
)        
FROM ( VALUES (1),(2) )
  AS t(x);


                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=2 width=4) (actual time=0.014..0.017 rows=2 loops=1)
   SubPlan 1
     ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.003..0.004 rows=0 loops=2)
           One-Time Filter: (1 = "*VALUES*".column1)
 Planning time: 0.097 ms
 Execution time: 0.043 ms
(6 rows)

This follows vaguely your above examples. So when is a correlated subquery better than a join? Not sure, but the correlated subquery gets to skip the hash join.

You have in your plan this

Filter: ((id_site IS NULL) AND ("labelDate" < '2015-09-01'::date))

That's not even the same SQL you've got above. That makes me call into question the whole thing.

I would do this though

CREATE INDEX ON pages ( labelDate ) WHERE id_site IS NULL;

If you don't have an index on id_site IS NULL or id_site then it seems likely you're just forcing a seq scan there.