PostgreSQL – Updating Columns with the Same Name

postgresqltableupdate

I may be a newbie question but I don't know how to resolve this. I have a table like:

name | id       | value
A      1286487    1286333
B      1286489    1286403
C      1286495    1286455
C      1286496    1286375
D      1286503    1286341
B      1286506    1286343

And I want to update this table to make it like this:

name | id       | value
A      1286487    1286333
B      1286489    1286403
C      1286495    1286455
C      1286495    1286375
D      1286503    1286341
B      1286489    1286343

So rows with name B and C have the same id like the first row with this name. Can anyone help me with that?

Best Answer

To fulfil the requirements, we have to first find the ID that we need - this can be found using the min() function, not surprisingly. Then we do the update - the quickest to write is possibly the following:

UPDATE minupdate m 
   SET id = (SELECT min(id) 
               FROM minupdate
              WHERE name = m.name);

The backside of this is it can be very slow, if you have a bigger number of rows. I have populated the table with some 50,000 random rows, using the following statement:

INSERT INTO minupdate (name, id)
SELECT translate(substr(i::text, 1, 1), '1234567890', 'ABCDEFGHIJ'), 
       random() * 100000
FROM generate_series(1, 50000) t(i);

On my test box, without an index, this took looong minutes - I did not wait for it to produce an actual execution plan (using EXPLAIN (ANAYLZE, BUFFERS), but cancelled and run it with EXPLAIN to see what's wrong:

 Update on minupdate m  (cost=0.00..136044281.50 rows=87720 width=12)
   ->  Seq Scan on minupdate m  (cost=0.00..136044281.50 rows=87720 width=12)
         SubPlan 1
           ->  Aggregate  (cost=1550.87..1550.88 rows=1 width=4)
                 ->  Seq Scan on minupdate  (cost=0.00..1526.50 rows=9747 width=4)
                       Filter: (name = m.name)

So I've added an index in (name, id) to get some faster results:

 Update on minupdate m  (cost=0.00..417486.00 rows=50000 width=15) (actual time=1771.008..1771.008 rows=0 loops=1)
   Buffers: shared hit=394845 read=653 dirtied=654
   ->  Seq Scan on minupdate m  (cost=0.00..417486.00 rows=50000 width=15) (actual time=0.062..1317.366 rows=50000 loops=1)
         Buffers: shared hit=172641 read=192
         SubPlan 2
           ->  Result  (cost=8.31..8.32 rows=1 width=0) (actual time=0.018..0.020 rows=1 loops=50000)
                 Buffers: shared hit=171655 read=192
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.29..8.31 rows=1 width=4) (actual time=0.011..0.012 rows=1 loops=50000)
                         Buffers: shared hit=171655 read=192
                         ->  Index Only Scan using minupdate_name_id_idx on minupdate  (cost=0.29..8.31 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=50000)
                               Index Cond: ((name = m.name) AND (id IS NOT NULL))
                               Heap Fetches: 50000
                               Buffers: shared hit=171655 read=192

Now the problem is this: the inner part of the plan is executed 50,000 times - the number of rows to be updated. In the index-less version, it would scan the whole table this may times - no wonder it would take very long.

But there is more to it. Those 50,000 loops are ugly - let's find something nicer:

EXPLAIN (ANALYZE, BUFFERS)
UPDATE minupdate m 
   SET id = mins.id 
  FROM (SELECT name, min(id) AS id 
          FROM minupdate 
         GROUP BY name
       ) mins
WHERE m.name = mins.name;

This gives a more streamlined plan:

 Update on minupdate m  (cost=1861.76..4525.59 rows=91698 width=42) (actual time=995.845..995.845 rows=0 loops=1)
   Buffers: shared hit=241809 read=394 dirtied=843
   ->  Hash Join  (cost=1861.76..4525.59 rows=91698 width=42) (actual time=218.866..542.437 rows=50000 loops=1)
         Hash Cond: (m.name = mins.name)
         Buffers: shared hit=972 dirtied=427
         ->  Seq Scan on minupdate m  (cost=0.00..1402.98 rows=91698 width=8) (actual time=0.027..106.868 rows=50000 loops=1)
               Buffers: shared hit=486 dirtied=1
         ->  Hash  (cost=1861.65..1861.65 rows=9 width=36) (actual time=218.816..218.816 rows=9 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               Buffers: shared hit=486 dirtied=426
               ->  Subquery Scan on mins  (cost=1861.47..1861.65 rows=9 width=36) (actual time=218.738..218.790 rows=9 loops=1)
                     Buffers: shared hit=486 dirtied=426
                     ->  HashAggregate  (cost=1861.47..1861.56 rows=9 width=6) (actual time=218.722..218.740 rows=9 loops=1)
                           Group Key: minupdate.name
                           Buffers: shared hit=486 dirtied=426
                           ->  Seq Scan on minupdate  (cost=0.00..1402.98 rows=91698 width=6) (actual time=0.007..111.580 rows=50000 loops=1)
                                 Buffers: shared hit=486 dirtied=426
 Planning time: 0.164 ms
 Execution time: 995.908 ms

There are now two sequential scan nodes there (somewhat surprisingly), but both are executed only once (loops=1).

What I've done here is removing a correlated subquery, replacing it by a temporary view (the query after FROM), joining it to the table to be updated. This approach often gives far better plans than the easy way. On big tables the performance difference can be quite significant.