PostgreSQL update column in one table with value from another, based on key from first table

postgresqlpostgresql-11postgresql-performanceupdate

Sorry about the title, I wasn't sure how best to word this. My situation is this:

I have two tables:

paper_author [paper_id, author_id, author_name]
authors [author_id, author_name]

Right now, the author_name column in paper_author is NULL. I would like to select the author name from the authors table and insert it into the author_name column in the paper_author table. I've written the following query:

UPDATE paper_author pa SET author_name = (SELECT author_name FROM authors a WHERE a.author_id = pa.author_id);

Which I think will do what I want, but this is taking a very long time to run (days). For reference, the table paper_author has ~900M rows and the table authors has ~200M rows. The author_id is the primary key in the authors table. author_id has an index in the table paper_author. The rows in paper_author are not unique on author_id (i.e. each author_id may appear multiple times associated with different papers).

Is there a more efficient way to write this? Have I missed something?

Best Answer

An update of that magnitude is inevitably going to take a substantial amount of time, no matter how good your configuration is or how powerful the hardware. It's possible there might be room for improvement, but there are not enough details to go on here. For a start, it would be interesting to know the output of:

EXPLAIN (VERBOSE, SETTINGS)
   UPDATE paper_author pa SET author_name = 
   (SELECT author_name FROM authors a WHERE a.author_id = pa.author_id)

You might want to consider an alternative strategy along the lines of breaking up the update into smaller batches. That will give you a better idea of how long the process might last, and provide better control over your system resources.