I recently started converting a personal project from Microsoft SQL Server to PostgreSQL and I was surprised at the abysmal performance I encountered doing an UPDATE JOIN
between two tables.
Suppose they look something like:
CREATE TABLE foo (
id INTEGER NOT NULL PRIMARY KEY,
bar INTEGER NULL
);
CREATE TABLE foo2 (
id INTEGER NOT NULL PRIMARY KEY,
bar INTEGER NULL
);
In T-SQL I would do an update using a join using something like this:
UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2
ON t1.id = t2.id;
But running in Postgres, the query is glacially slow.
If I change it to:
UPDATE foo
SET bar = t2.bar
FROM foo2 t2
WHERE foo.id = t2.id;
it's not a problem.
I get that the syntax is different but I would've expected the query optimizer to work out something in the same ball park. Instead, things go bananas. Besides the syntactical differences, is there a nuanced difference between the two queries that I fail to see?
Explain plans
Update on foo (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
-> Nested Loop (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
-> Seq Scan on foo (cost=0.00..145721.10 rows=582410 width=1010)
-> Materialize (cost=85852.43..247935.91 rows=580908 width=17)
-> Hash Join (cost=85852.43..241627.37 rows=580908 width=17)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on foo t1 (cost=0.00..145721.10 rows=582410 width=10)
-> Hash (cost=75754.08..75754.08 rows=580908 width=15)
-> Seq Scan on foo2 t2 (cost=0.00..75754.08 rows=580908 width=15)
Update on foo (cost=87575.47..535974.25 rows=581621 width=1022)
-> Hash Join (cost=87575.47..535974.25 rows=581621 width=1022)
Hash Cond: (foo.id = t2.id)
-> Seq Scan on foo (cost=0.00..151301.17 rows=1140417 width=1011)
-> Hash (cost=75761.21..75761.21 rows=581621 width=36)
-> Seq Scan on foo2 t2 (cost=0.00..75761.21 rows=581621 width=36)
Best Answer
There is no join condition between
foo
andt1
, the implicitCROSS JOIN
forces a Cartesian product, i.e.O(N²)
(!) update operations instead of justO(N)
. And the result is non-deterministic nonsense. The effect also becomes apparent in the query plan:rows=338326628280
instead ofrows=581621
(Also: both plans were produced off slightly different tables, but that seems irrelevant to the question.)Could be fixed by adding a join condition like:
Well, technically, a
WHERE
condition, but all the same.But that's just putting lipstick on a pig. While
id
is the PK column of each table, that's just adding noise. Use the command you already found instead:The manual advises for the
FROM
clause ofUPDATE
:And:
Such a self-join makes sense (or is even necessary!) if you need a
LEFT [OUTER] JOIN
to additional table(s). Sadly, there is no provision in SQL to say"FROM LEFT"
in anUPDATE
. Example: