Postgresql – Running a UPDATE on 3 million records

postgresqlpostgresql-9.2

I have two actors in this transaction. One actor is a table "update_assets" is a set of records with data that is up-to-date and new. The second actor is a table "application_assets" which is a table used by an application and needs to be updated. The problem is that the update on this is taking far too long. The UPDATE is now going on 4 hours. The transaction is running on a vm with 3 amd cores allocated to it. 8 gb of ram. The application is not running in the background.

Tables:

update_assets
~3.3 million records
18 columns
1 index on amsid

application_assets
~7 million records
5 columns
1 index on id

Query:

UPDATE application_assets as ams SET number = amu.num, letter = amu.letter 
FROM update_assets amu
WHERE ams.id = amu.amsid
AND ams.company_id = '4';

Explain output:

Update on application_assets ams  (cost=219382.14..747965.42 rows=3303562 width=85)
  ->  Hash Join  (cost=219382.14..747965.42 rows=3303562 width=85)
        Hash Cond: ((ams.id)::text = (amu.amsid)::text)
        ->  Seq Scan on application_assets ams  (cost=0.00..244642.25 rows=7883425 width=63)
        ->  Hash  (cost=145825.62..145825.62 rows=3303562 width=55)
              ->  Seq Scan on update_assets amu  (cost=0.00..145825.62 rows=3303562 width=55)

If I removed all of the columns on the update_assets table, except for the pertinent columns for this transaction, will that speed up the transaction? I noticed that in the Hash Cond portion of the transaction that it does a :: casting operation on each of the columns. If the columns were already in the text data type format, would the data be updated quicker? Does postgresql always change data to text data for these types of joins? What else can I do to expedite this process?

Best Answer

First, 4 hours seems like a long time for this. I have done big updates and would not expect it to take this long were an inner join is involved (antijoins are more troublesome).

First to your questions:

  1. If I removed all of the columns on the update_assets table, except for the pertinent columns for this transaction, will that speed up the transaction? Maybe, if you rewrote the table after (using cluster, or select into).

  2. Does postgresql always change data to text data for these types of joins? Nope. So turning everything to text will not help.

So what you can do is to look and see what is happening. Initial troubleshooting steps need to be:

  1. Run top and watch the query. Is it I/O bound or CPU bound? What is going on memory-usage-wise?

  2. Does your db fit comfortably in RAM? If not consider allocating more RAM to this vm.

  3. Check the data types for the join columns. Make sure these are the same. Bad things can happen when it is casting to text for a join like this.

  4. Check for locks. Maybe the process is continuing slowly during a transactional period and waiting for locks periodically?