Postgresql – Select data from table join or UDPATE

performancepostgresqlupdate

I have a question regarding PostgreSQL DB performance. I am given a CSV file with a lot of rows (let's say around 1 million rows) that represents updates to a database table (let's call it MY_TABLE). I would like to copy the data into MY_TABLE from the CSV file. It is presumed that all the rows in the CSV are updates to the data in MY_TABLE and are not new inserts. I have 2 options for doing this, and I would like to know which one I should expect to be more performant and why (if possible):

Option 1:

1) Create a secondary table, let's say TEMP_TABLE.
2) Upload all the data from the CSV into TEMP_TABLE.
3) Natural join (on PK) MY_TABLE x TEMP_TABLE.
4) Select rows from MY_TABLE x TEMP_TABLE into MY_TABLE where the PK from the join matches the PK in MY_TABLE.

Option 2:

1) Create a SQL script on client-side containing N UPDATE operations of the form:

UPDATE MY_TABLE SET VALUES (data-from-csv) WHERE MY_TABLE.PK = (PK-data-from-csv)

where N is the number of rows in the CSV.

2) Upload that script to the database and execute it.

Which of these options is more performant and why? Thanks.

Best Answer

I would think that Option 2 is more preferable because of the following:

  1. Less overall writes to the database (no extra table required)
  2. Less overall WAL activity (Option 1 involves both an INSERT/COPY and an UPDATE for each row involved)

In either case, you could encounter some table bloat, in which case you'd want to make sure you VACUUM after doing this operation.