Postgresql – Select data from table join or UDPATE

performancepostgresqlupdate

I have a question regarding PostgreSQL DB performance. I am given a CSV file with a lot of rows (let's say around 1 million rows) that represents updates to a database table (let's call it MY_TABLE). I would like to copy the data into MY_TABLE from the CSV file. It is presumed that all the rows in the CSV are updates to the data in MY_TABLE and are not new inserts. I have 2 options for doing this, and I would like to know which one I should expect to be more performant and why (if possible):

Option 1:

1) Create a secondary table, let's say TEMP_TABLE.
2) Upload all the data from the CSV into TEMP_TABLE.
3) Natural join (on PK) MY_TABLE x TEMP_TABLE.
4) Select rows from MY_TABLE x TEMP_TABLE into MY_TABLE where the PK from the join matches the PK in MY_TABLE.

Option 2:

1) Create a SQL script on client-side containing N UPDATE operations of the form:

UPDATE MY_TABLE SET VALUES (data-from-csv) WHERE MY_TABLE.PK = (PK-data-from-csv)

where N is the number of rows in the CSV.

2) Upload that script to the database and execute it.

Which of these options is more performant and why? Thanks.

Best Answer

I would think that Option 2 is more preferable because of the following:

Less overall writes to the database (no extra table required)
Less overall WAL activity (Option 1 involves both an INSERT/COPY and an UPDATE for each row involved)

In either case, you could encounter some table bloat, in which case you'd want to make sure you VACUUM after doing this operation.

Related Solutions

Innodb – Fastest way to copy data from MyISAM to InnoDB

There are a few solutions. First, however, I'm not sure about the process in the first place. Is this a one time copy operation? A recurring copy operation? Do you want to migrate from MyISAM to InnoDB?

What is the main reason for your desire for a quick operation?

If you're looking for migration, then why don't you use an online table alter tool, such as oak-online-alter-table (disclaimer: I'm author of this tool) or pt-online-schema-change? Both will allow you to change your schema live and online with very little disturbance.

If you're looking to a copy+paste of your data, then I would suggest using chunking: copying the data in small packets. This way you don't get that huge lock and no funny timeouts. You can use either oak-chunk-update or pt-archiver for this. This may actually make the total runtime shorter because of reduces locking, but may also take longer. Also consider that it is not an atomic operation, and changes to original table while copying is made, may not get caught, so you may get an inconsistent copy.

Otherwise (or in addition) you can use all the usual tweaks, such as

SET GLOBAL innodb_flush_log_at_trx_commit := 2;

or set

[mysqld]
innodb_doublewrite = 0

or perhaps, depending on OS and disks,

[mysqld]
innodb_flush_method = O_DIRECT

Each of the above may reduce disk I/O access. First two will also make your server less crash safe. But if for limited time, this may be OK for you.

Postgresql – Update all columns from another table

There is no syntax variant that lets you update the whole row at once. However, there is a shorter form than what you have so far.

Also, you do not actually want to update all columns. The WHERE condition on id pins down at least one column (id) to remain unchanged. But that's just nitpicking.

UPDATE table_a a
SET    (  c1,   c2, ...)
     = (b.c1, b.c2, ...)
FROM   table_b b
WHERE  a.id = b.id;

More details in this related answer:
Bulk update of all columns

`DELETE / INSERT`

Internally, due to the MVCC model of Postgres, every UPDATE effectively inserts a new row anyway and marks the old one as obsolete. So, behind the curtains there is not much difference between UPDATE and DELETE plus INSERT.
There are some details in favor of the UPDATE route:

HOT UPDATE.
TOAST tables: If you have large columns, the content may be stored."out-of-line" in TOAST tables and the new row version can link to the same row in the TOAST table if toasted columns remain unchanged.
Index maintenance may be cheaper for updates.

Otherwise, locking should be about the same. You need an exclusive lock on affected rows either way. Just make it quick.
If you are dealing with a huge number of rows and you don't need a consistent state (all rows or none), you can split the operation into multiple batches. (Separate transactions!) Increases the total cost, but keeps the lock time per row shorter.

Best Answer

Related Solutions

Innodb – Fastest way to copy data from MyISAM to InnoDB

Postgresql – Update all columns from another table

DELETE / INSERT

Related Question

`DELETE / INSERT`