PostgreSQL – When to Break Large Delete Queries

postgresql

I've got an auto generated join table (three columns, two of them keys to other tables), which was recently corrupted during an aborted migration. As a result there are around 1 million duplicate rows that need to be removed. I've been reading into how to best do this and I've run into conflicting advice. I've managed to get a list of the duplicates which need to be deleted using this query:

SELECT MIN(id)
FROM my_join_table
WHERE site_id=42    -- The migrations targeted a single site.
GROUP BY content_id
HAVING COUNT(*)>1;

However, when I tried to run a simple delete query with the above as an inner query, the command timed out after several hours.

DELETE FROM my_join_table
WHERE id IN ( <insert above query here> );

So my question is two part:

  1. Would breaking that large (~1 Million ids) set up into smaller delete queries be more efficient?
  2. If so, what's a good candidate for smaller set sizes? From what I can see online, no one suggests breaking up sets of fewer than 100 elements, but I'm hoping that I can get away with 1,000 or even 10,000.

Best Answer

from experience i can say that one large delete is usually better. however, there can be some corner cases with IN, which might invalidate my statement but basically this is true. Make sure your got enough work_mem around to allow PostgreSQL to nicely hash the IN.