Sql-server – PostgreSQL update join vs SQL Server update join

join;postgresqlsql serverupdate

I recently started converting a personal project from Microsoft SQL Server to PostgreSQL and I was surprised at the abysmal performance I encountered doing an UPDATE JOIN between two tables.

Suppose they look something like:

CREATE TABLE foo (
  id INTEGER NOT NULL PRIMARY KEY,
  bar INTEGER NULL
);

CREATE TABLE foo2 (
  id INTEGER NOT NULL PRIMARY KEY,
  bar INTEGER NULL
);

In T-SQL I would do an update using a join using something like this:

UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2
ON t1.id = t2.id;

But running in Postgres, the query is glacially slow.

If I change it to:

UPDATE foo
SET bar = t2.bar
FROM foo2 t2
WHERE foo.id = t2.id;

it's not a problem.

I get that the syntax is different but I would've expected the query optimizer to work out something in the same ball park. Instead, things go bananas. Besides the syntactical differences, is there a nuanced difference between the two queries that I fail to see?

Explain plans

Update on foo  (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
  ->  Nested Loop  (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
        ->  Seq Scan on foo  (cost=0.00..145721.10 rows=582410 width=1010)
        ->  Materialize  (cost=85852.43..247935.91 rows=580908 width=17)
              ->  Hash Join  (cost=85852.43..241627.37 rows=580908 width=17)
                    Hash Cond: (t1.id = t2.id)
                    ->  Seq Scan on foo t1  (cost=0.00..145721.10 rows=582410 width=10)
                    ->  Hash  (cost=75754.08..75754.08 rows=580908 width=15)
                          ->  Seq Scan on foo2 t2  (cost=0.00..75754.08 rows=580908 width=15)

Update on foo (cost=87575.47..535974.25 rows=581621 width=1022)
  ->  Hash Join  (cost=87575.47..535974.25 rows=581621 width=1022)
        Hash Cond: (foo.id = t2.id)
        ->  Seq Scan on foo (cost=0.00..151301.17 rows=1140417 width=1011)
        ->  Hash  (cost=75761.21..75761.21 rows=581621 width=36)
              ->  Seq Scan on foo2 t2  (cost=0.00..75761.21 rows=581621 width=36)

Best Answer

But running in Postgres, the query is glacially slow.

UPDATE foo
SET    bar = t2.bar
FROM   foo t1
JOIN   foo2 t2 ON t1.id = t2.id;

There is no join condition between foo and t1, the implicit CROSS JOIN forces a Cartesian product, i.e. O(N²) (!) update operations instead of just O(N). And the result is non-deterministic nonsense. The effect also becomes apparent in the query plan: rows=338326628280 instead of rows=581621 (Also: both plans were produced off slightly different tables, but that seems irrelevant to the question.)

Could be fixed by adding a join condition like:

UPDATE foo
SET    bar = t2.bar
FROM   foo t1
JOIN   foo2 t2 ON t1.id = t2.id
WHERE  foo.id = t1.id;  -- !

Well, technically, a WHERE condition, but all the same.

But that's just putting lipstick on a pig. While id is the PK column of each table, that's just adding noise. Use the command you already found instead:

UPDATE foo
SET    bar = t2.bar
FROM   foo2 t2
WHERE  foo.id = t2.id;

The manual advises for the FROM clause of UPDATE:

Do not repeat the target table as a from_item unless you intend a self-join (in which case it must appear with an alias in the from_item).

And:

When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_item list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

Such a self-join makes sense (or is even necessary!) if you need a LEFT [OUTER] JOIN to additional table(s). Sadly, there is no provision in SQL to say "FROM LEFT" in an UPDATE. Example:

Nullify column in update if subquery returns empty

Explain plans

Best Answer

Related Solutions

Postgresql – Running a UPDATE on 3 million records

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question