This relates to the dataset described in Postgres is performing sequential scan instead of index scan
I've started working on adapting the import logic to work with a more normalised schema – no surprises here it's faster and more compact – but I've hit a roadblock updating the existing data: adding and updating with relevant foreign keys is taking an age.
UPDATE pages
SET id_site = id
FROM sites
WHERE sites.url = pages."urlShort"
AND "labelDate" = '2015-01-15'
NB pages."urlShort"
and sites.url
are text fields, both are indexed but currently have no explicit relationship.
There are around 500,000 rows for each date value and updates like this are taking around 2.5 hours. 🙁
I looked at what the underlying query might look at:
select *
from pages
join sites on
sites.url = pages."urlShort"
where "labelDate" = '2015-01-01'
This takes around 6 minutes to run has query plan like this:
Hash Join (cost=80226.81..934763.02 rows=493018 width=365)
Hash Cond: ((pages."urlShort")::text = sites.url)
-> Bitmap Heap Scan on pages (cost=13549.32..803595.26 rows=493018 width=315)
Recheck Cond: ("labelDate" = '2015-01-01'::date)
-> Bitmap Index Scan on "pages_labelDate_idx (cost=0.00..13426.07 rows=493018 width=0)
Index Cond: ("labelDate" = '2015-01-01'::date)
-> Hash (cost=30907.66..30907.66 rows=1606466 width=50)
-> Seq Scan on sites (cost=0.00..30907.66 rows=1606466 width=50)
There are two things I'd like to know:
- Can I adjust the update to run faster based on the above?
- What parts of the query plan are telltales for running slow? Or do you always have to run
EXPLAIN ANALYZE
to find it out?
Here's the output from running EXPLAIN ANALYZE
during the update:
Update on pages (cost=65108.49..1545071.72 rows=4037902 width=331) (actual time=48181951.584..48181951.584 rows=0 loops=1)
-> Hash Join (cost=65108.49..1545071.72 rows=4037902 width=331) (actual time=4075.973..1200902.835 rows=1909499 loops=1)
Hash Cond: ((pages."urlShort")::text = sites.url)
-> Seq Scan on pages (cost=0.00..1056057.95 rows=4037902 width=317) (actual time=0.025..456909.895 rows=2053904 loops=1)
Filter: ((id_site IS NULL) AND ("labelDate" < '2015-09-01'::date))
Rows Removed by Filter: 12105346
-> Hash (cost=30907.66..30907.66 rows=1606466 width=41) (actual time=4061.106..4061.106 rows=1606489 loops=1)
Buckets: 2097152 Batches: 2 Memory Usage: 74179kB
-> Seq Scan on sites (cost=0.00..30907.66 rows=1606466 width=41) (actual time=0.024..869.068 rows=1606489 loops=1)
Planning time: 3.767 ms
Execution time: 48181966.394 ms
I've tried different memory settings (from 128MB to 1024MB shared_buffers) but they don't seem to make much difference.
Best Answer
To start, a correlated subquery can't return more than one row. Take for instance.
Versus the same query written as a correlated subquery
That tells the planner something useful, and it can optimize that. Adjusting for this, by reducing the inner table to one row -- you'll see the plans:
Correlated subquery method.
This follows vaguely your above examples. So when is a correlated subquery better than a join? Not sure, but the correlated subquery gets to skip the hash join.
You have in your plan this
That's not even the same SQL you've got above. That makes me call into question the whole thing.
I would do this though
If you don't have an index on
id_site IS NULL
orid_site
then it seems likely you're just forcing a seq scan there.