PostgreSQL – Updating Columns with the Same Name

postgresqltableupdate

I may be a newbie question but I don't know how to resolve this. I have a table like:

name | id       | value
A      1286487    1286333
B      1286489    1286403
C      1286495    1286455
C      1286496    1286375
D      1286503    1286341
B      1286506    1286343

And I want to update this table to make it like this:

name | id       | value
A      1286487    1286333
B      1286489    1286403
C      1286495    1286455
C      1286495    1286375
D      1286503    1286341
B      1286489    1286343

So rows with name B and C have the same id like the first row with this name. Can anyone help me with that?

Best Answer

To fulfil the requirements, we have to first find the ID that we need - this can be found using the min() function, not surprisingly. Then we do the update - the quickest to write is possibly the following:

UPDATE minupdate m 
   SET id = (SELECT min(id) 
               FROM minupdate
              WHERE name = m.name);

The backside of this is it can be very slow, if you have a bigger number of rows. I have populated the table with some 50,000 random rows, using the following statement:

INSERT INTO minupdate (name, id)
SELECT translate(substr(i::text, 1, 1), '1234567890', 'ABCDEFGHIJ'), 
       random() * 100000
FROM generate_series(1, 50000) t(i);

On my test box, without an index, this took looong minutes - I did not wait for it to produce an actual execution plan (using EXPLAIN (ANAYLZE, BUFFERS), but cancelled and run it with EXPLAIN to see what's wrong:

 Update on minupdate m  (cost=0.00..136044281.50 rows=87720 width=12)
   ->  Seq Scan on minupdate m  (cost=0.00..136044281.50 rows=87720 width=12)
         SubPlan 1
           ->  Aggregate  (cost=1550.87..1550.88 rows=1 width=4)
                 ->  Seq Scan on minupdate  (cost=0.00..1526.50 rows=9747 width=4)
                       Filter: (name = m.name)

So I've added an index in (name, id) to get some faster results:

 Update on minupdate m  (cost=0.00..417486.00 rows=50000 width=15) (actual time=1771.008..1771.008 rows=0 loops=1)
   Buffers: shared hit=394845 read=653 dirtied=654
   ->  Seq Scan on minupdate m  (cost=0.00..417486.00 rows=50000 width=15) (actual time=0.062..1317.366 rows=50000 loops=1)
         Buffers: shared hit=172641 read=192
         SubPlan 2
           ->  Result  (cost=8.31..8.32 rows=1 width=0) (actual time=0.018..0.020 rows=1 loops=50000)
                 Buffers: shared hit=171655 read=192
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.29..8.31 rows=1 width=4) (actual time=0.011..0.012 rows=1 loops=50000)
                         Buffers: shared hit=171655 read=192
                         ->  Index Only Scan using minupdate_name_id_idx on minupdate  (cost=0.29..8.31 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=50000)
                               Index Cond: ((name = m.name) AND (id IS NOT NULL))
                               Heap Fetches: 50000
                               Buffers: shared hit=171655 read=192

Now the problem is this: the inner part of the plan is executed 50,000 times - the number of rows to be updated. In the index-less version, it would scan the whole table this may times - no wonder it would take very long.

But there is more to it. Those 50,000 loops are ugly - let's find something nicer:

EXPLAIN (ANALYZE, BUFFERS)
UPDATE minupdate m 
   SET id = mins.id 
  FROM (SELECT name, min(id) AS id 
          FROM minupdate 
         GROUP BY name
       ) mins
WHERE m.name = mins.name;

This gives a more streamlined plan:

 Update on minupdate m  (cost=1861.76..4525.59 rows=91698 width=42) (actual time=995.845..995.845 rows=0 loops=1)
   Buffers: shared hit=241809 read=394 dirtied=843
   ->  Hash Join  (cost=1861.76..4525.59 rows=91698 width=42) (actual time=218.866..542.437 rows=50000 loops=1)
         Hash Cond: (m.name = mins.name)
         Buffers: shared hit=972 dirtied=427
         ->  Seq Scan on minupdate m  (cost=0.00..1402.98 rows=91698 width=8) (actual time=0.027..106.868 rows=50000 loops=1)
               Buffers: shared hit=486 dirtied=1
         ->  Hash  (cost=1861.65..1861.65 rows=9 width=36) (actual time=218.816..218.816 rows=9 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               Buffers: shared hit=486 dirtied=426
               ->  Subquery Scan on mins  (cost=1861.47..1861.65 rows=9 width=36) (actual time=218.738..218.790 rows=9 loops=1)
                     Buffers: shared hit=486 dirtied=426
                     ->  HashAggregate  (cost=1861.47..1861.56 rows=9 width=6) (actual time=218.722..218.740 rows=9 loops=1)
                           Group Key: minupdate.name
                           Buffers: shared hit=486 dirtied=426
                           ->  Seq Scan on minupdate  (cost=0.00..1402.98 rows=91698 width=6) (actual time=0.007..111.580 rows=50000 loops=1)
                                 Buffers: shared hit=486 dirtied=426
 Planning time: 0.164 ms
 Execution time: 995.908 ms

There are now two sequential scan nodes there (somewhat surprisingly), but both are executed only once (loops=1).

What I've done here is removing a correlated subquery, replacing it by a temporary view (the query after FROM), joining it to the table to be updated. This approach often gives far better plans than the easy way. On big tables the performance difference can be quite significant.

Related Solutions

PostgreSQL Update – How to Update Different Columns on Different Clauses

I would probably do something like this:

UPDATE tablename
SET col1 = col1 + CASE WHEN col1 > $VAR THEN 2 ELSE 0 END,
    col2 = col2 + CASE WHEN col2 >= $VAR THEN 2 ELSE 0 END
WHERE col1 > $VAR OR col2 >= $VAR

Make sure that col1 and col2 are indexed if you are only updating a few rows and tablename has many rows.

PostgreSQL – Does Updating a Row with the Same Value Actually Update the Row?

Due to the MVCC model of Postgres, and according to the rules of SQL, an UPDATE writes a new row version for every row that is not excluded in the WHERE clause.

This does have a more or less substantial impact on performance, directly and indirectly. "Empty updates" have the same cost per row as any other update. They fire triggers (if present) like any other update, they have to be WAL-logged and they produce dead rows bloating the table and causing more work for VACUUM later like any other update.

Index entries and TOASTed columns where none of the involved columns are changed can stay the same, but that is true for any updated row. Related:

It's almost always a good idea to exclude such empty updates (when there is an actual chance it may happen). You did not provide a table definition in your question. We have to assume first_name can be NULL (which wouldn't be surprising for a "first name"), hence the query has to use NULL-safe comparison:

UPDATE users
SET    first_name = 'Michael'
WHERE  id = 123
AND    first_name IS DISTINCT FROM 'Michael';

If first_name IS NULL before the update, a test with just first_name <> 'Michael' would evaluate to NULL and as such exclude the row from the update. Sneaky error. If the column is defined NOT NULL, use the simple equality check, though, that's a bit cheaper.

Best Answer

Related Solutions

PostgreSQL Update – How to Update Different Columns on Different Clauses

PostgreSQL – Does Updating a Row with the Same Value Actually Update the Row?

Related Question