I'm wondering what the most efficient way would be to delete large numbers of rows from PostgreSQL, this process would be part of a recurring task every day to bulk import data (a delta of insertions + deletions) into a table. There could be thousands, potentially millions of rows to delete.
I have a file of primary keys, one per line. The two options I was thinking of were along the lines of the below, but I don't know/understand enough of the internals of PostgreSQL to make an informed decision which would be best.
- Execute a
DELETE
query for each row in the file, with a simpleWHERE
on primary key (or group the deletes in batches ofn
using anIN()
clause) - Import the primary keys into a temporary table using the
COPY
command and then deleting from the main table using a join
Any suggestions will be much appreciated!
Best Answer
Your second option is far cleaner and will perform well enough to make that worth it. Your alternative is to build gigantic queries which will be quite a pain to plan and execute. In general you are going to be better off letting PostgreSQL do the work here. In general, I have found updates on tens of thousands of rows in the manner you are describing to perform adequately, but there is one important thing to avoid doing.
The way to do it is to use a select and a join in your delete.
Under no circumstances should you as follows with a large table:
This will usually cause a nested loop antijoin which will make performance rather problematic. If you end up having to go that route, do this instead:
PostgreSQL is usually pretty good at avoiding bad plans but there are still cases involving outer joins which can make a big difference between good and bad plans.
This is wandering a little more far afield, but I figure it is worth mentioning because of how easy it is to go from the IN to NOT IN and watch query performance tank.