I have a question regarding PostgreSQL DB performance. I am given a CSV file with a lot of rows (let's say around 1 million rows) that represents updates to a database table (let's call it MY_TABLE). I would like to copy the data into MY_TABLE from the CSV file. It is presumed that all the rows in the CSV are updates to the data in MY_TABLE and are not new inserts. I have 2 options for doing this, and I would like to know which one I should expect to be more performant and why (if possible):
Option 1:
1) Create a secondary table, let's say TEMP_TABLE.
2) Upload all the data from the CSV into TEMP_TABLE.
3) Natural join (on PK) MY_TABLE x TEMP_TABLE.
4) Select rows from MY_TABLE x TEMP_TABLE into MY_TABLE where the PK from the join matches the PK in MY_TABLE.
Option 2:
1) Create a SQL script on client-side containing N UPDATE operations of the form:
UPDATE MY_TABLE SET VALUES (data-from-csv) WHERE MY_TABLE.PK = (PK-data-from-csv)
where N is the number of rows in the CSV.
2) Upload that script to the database and execute it.
Which of these options is more performant and why? Thanks.
Best Answer
I would think that Option 2 is more preferable because of the following:
INSERT
/COPY
and anUPDATE
for each row involved)In either case, you could encounter some table bloat, in which case you'd want to make sure you
VACUUM
after doing this operation.