Postgresql – Update table with values from another table runs really slow

join;postgresqlupdate

I have a PostgreSQL table (tab_A) with about 32 million records, and I have a second PostgreSQL table (tab_B) with about 4000 records. tab_B contains 3 fields whose values I intend to pass on to tab_A (field1, field2, field3).

tab_A

32 million records
Unique identifier (id)
BTREE index

tab_B

4000 records
Unique identifier (id)
BTREE index

I am trying to do the job with the following query (previous to do this I have already created the 3 new fields in tab_A to host the values):

UPDATE tab_A
SET field1 = t2.field1, field2 = t2.field2, field3 = t2.field3
FROM tab_A t1 JOIN tab_B t2
ON t1.uprn = t2.uprn;

This query runs on and on for more than 5 hours and I eventually need to stop it because it doesn't seem to me it should be taking that long (my understanding is that if both tables contain an index this should be pretty fast).

Any ideas on whether I am missing something here? Perhaps it's normal it takes that long taking into account tab_A contains 32 million records? Any other approach to run this more efficiently?

Best Answer

Don't repeat the target table in the FROM clause:

UPDATE tab_A
   SET field1 = t2.field1, field2 = t2.field2, field3 = t2.field3
FROM tab_B t2
WHERE tab_A.uprn = t2.uprn;

Quote from the manual

Note that the target table must not appear in the from_list, unless you intend a self-join

^{(emphasis mine)}

Related Solutions

Postgresql – Update column with data from another table

The proper form would be (assuming current pg version 9.3 for lack of information):

UPDATE C 
SET    column1 = A.column1 
     , column2 = B.column2 
     , column3 = A.column1 + B.column2
FROM   A
JOIN   B ON A.id = B.id  -- ??? not specified in question!
WHERE  C.id = A.id      --  ??? not specified in question!
AND   (C.column1, C.column2, C.column3) IS DISTINCT FROM
      (A.column1, B.column2, A.column1 + B.column2);

The last WHERE clause is optional to avoid empty updates that would not change anything (but still write a new row version at full cost).

ypercube already gave a basic explanation in his comment:

You don't get duplication. Your derived table is cross joining A and B (i.e. without any joining condition) and then choosing an arbitrary row (LIMIT 1 without ORDER BY). It then uses the values from that arbitrary row to update all rows of table C. If you want different values to be used for different rows of C, you'll have to join the 3 tables (using JOIN - ON and WHERE)

Refer to the manual on UPDATE for details.

Sql-server – Update field in SQL Server Table Dynamicaly

If your table is test, try this:

SELECT t1.*
    ,New_FileName = t1.FileName + CASE 
        WHEN t2.seq > 1
            THEN '_' + ltrim(str(seq))
        ELSE ''
        END
FROM test t1
JOIN (
    SELECT ID
        ,Memeber_Number
        ,FileName
        ,seq = ROW_NUMBER() OVER (
            PARTITION BY FileName
            ,Memeber_Number ORDER BY ID
            )
    FROM test
    ) t2 ON t1.id = t2.id;

Related Question