Postgresql – Most efficient way of bulk deleting rows from postgres

bulkdeletepostgresql

I'm wondering what the most efficient way would be to delete large numbers of rows from PostgreSQL, this process would be part of a recurring task every day to bulk import data (a delta of insertions + deletions) into a table. There could be thousands, potentially millions of rows to delete.

I have a file of primary keys, one per line. The two options I was thinking of were along the lines of the below, but I don't know/understand enough of the internals of PostgreSQL to make an informed decision which would be best.

Execute a DELETE query for each row in the file, with a simple WHERE on primary key (or group the deletes in batches of n using an IN() clause)
Import the primary keys into a temporary table using the COPY command and then deleting from the main table using a join

Any suggestions will be much appreciated!

Best Answer

Your second option is far cleaner and will perform well enough to make that worth it. Your alternative is to build gigantic queries which will be quite a pain to plan and execute. In general you are going to be better off letting PostgreSQL do the work here. In general, I have found updates on tens of thousands of rows in the manner you are describing to perform adequately, but there is one important thing to avoid doing.

The way to do it is to use a select and a join in your delete.

DELETE FROM foo WHERE id IN (select id from rows_to_delete);

Under no circumstances should you as follows with a large table:

DELETE FROM foo WHERE id NOT IN (select id from rows_to_keep);

This will usually cause a nested loop antijoin which will make performance rather problematic. If you end up having to go that route, do this instead:

DELETE FROM foo 
WHERE id IN (select id from foo f 
          LEFT JOIN rows_to_keep d on f.id = d.id
              WHERE d.id IS NULL);

PostgreSQL is usually pretty good at avoiding bad plans but there are still cases involving outer joins which can make a big difference between good and bad plans.

This is wandering a little more far afield, but I figure it is worth mentioning because of how easy it is to go from the IN to NOT IN and watch query performance tank.

Best Answer

Related Solutions

Sql-server – BULK INSERT – best usage

PostgreSQL: How does the implicit index work when using “COPY” for loading bulk rows into a newly created table

Related Question