Postgresql – Performance degradation while updating tables having 10s of millions of records

performancepostgresqlquery-performanceupdate

I want to update tables ( my be 20-30 ) having 10s of millions of records each.

The problem is that it is taking too much time for the update process and also at that time CPU usage also goes very high. I want to do in such a way that it can not use much CPU while processing the data. If the processing time is increased then it will not be a problem for me but it should use limited CPU resources for processing ( updating ) the table. I am using PostgreSQL as database and server operating system is Linux.

Sample query of mine can be like this

UPDATE TEMP 
SET CUSTOMERNAME = 
  ( select customername from user where user.customerid = temp.customerid );

Best Answer

The first question is: Why is it important that you not use a lot of CPU time? The query will be bottlenecked on some resource; if you could introduce enough additional disk access, the CPU time used per second would go down, but would that really be an improvement? What resource would you prefer to saturate? Understanding why you have emphasized this might help guide people to providing an answer that you will find useful.

As suggested in a comment, your query may run faster with a join rather than a correlated subquery. Something like this:

UPDATE temp
  SET customername = user.customername
  FROM user
  WHERE user.customerid = temp.customerid;

Another important thing to know is whether you want to update all rows in the table. Are some of the values already correct? If so, you will get a big performance boost by not updating the rows that don't need it. Add AND temp.customername is distinct from user.customername to the WHERE clause.

If you limit the number of rows updated in each statement, and VACUUM ANALYZE after each UPDATE, you will avoid table bloat. If the point of the desire to minimize CPU time is to avoid a performance impact on concurrent transactions, this would give you the opportunity to introduce a short delay (in the form of a sleep or something similar) before you start the next UPDATE of a set of rows.

Even better, why are you redundantly storing the information in the temp table rather than joining to it when needed? (Sometimes there is a good reason; quite often there isn't.)