PostgreSQL – When to Break Large Delete Queries

postgresql

I've got an auto generated join table (three columns, two of them keys to other tables), which was recently corrupted during an aborted migration. As a result there are around 1 million duplicate rows that need to be removed. I've been reading into how to best do this and I've run into conflicting advice. I've managed to get a list of the duplicates which need to be deleted using this query:

SELECT MIN(id)
FROM my_join_table
WHERE site_id=42    -- The migrations targeted a single site.
GROUP BY content_id
HAVING COUNT(*)>1;

However, when I tried to run a simple delete query with the above as an inner query, the command timed out after several hours.

DELETE FROM my_join_table
WHERE id IN ( <insert above query here> );

So my question is two part:

Would breaking that large (~1 Million ids) set up into smaller delete queries be more efficient?
If so, what's a good candidate for smaller set sizes? From what I can see online, no one suggests breaking up sets of fewer than 100 elements, but I'm hoping that I can get away with 1,000 or even 10,000.

Best Answer

from experience i can say that one large delete is usually better. however, there can be some corner cases with IN, which might invalidate my statement but basically this is true. Make sure your got enough work_mem around to allow PostgreSQL to nicely hash the IN.

Related Solutions

Postgresql – Permissions to delete large objects

That's consistent with other objects, which cannot be DROPped except by the owner. It seems kind of painful for large objects, though; maybe bug -hackers?

I'd work around this by using oid references instead of lo, and then using vaccumlo to remove them once the references to them are removed.

Postgresql – Log large queries with PGbouncer

In short - no

You can show stats; to get the total_received and total_sent per database. Assuming you can check the stats before query and after, calculating the difference won't give you the query result size. Even in ideal env, when no other parallel sessions exist. Eg:

t=# copy (select id from t) to '/tmp/1';
COPY 1043482
t=# \! du -h /tmp/1
7.0M    /tmp/1

checking received:

t=# select 40493353-40493297;
 ?column?
----------
       56
(1 row)

hm, so the size of a query result is 7MB, but received bytes are 56. Ah! I saved result of query on server! So pgbouncer technically indeed did not receive large set of data, ok - do it to client:

t=# \copy (select objectid from pond_user) to '/tmp/1';
COPY 1043482
t=# select 40493442-40493353;
 ?column?
----------
       89
(1 row)

Same story... Maybe lets check bytes sent?..

t# select (355612622-343149820)/(1024*1024);
 ?column?
----------
       11
(1 row)

11MB... Not particularly precise match.

SO:

pgbouncer does not log the result set size, You can use show stats assuming you can isolate sessions in time and your estimations are to get a very rough approximation for sizes.

also

People are using iptraf or such for the same, Somebody logs long taking queries with log_min_duration_statement and then repeat them with CREATE TABLE AS to get the size, some even write patches for client.

even also

Analytic approach here leads to even more complications:

https://stackoverflow.com/questions/13570613/making-sense-of-postgres-row-sizes

Measure the size of a PostgreSQL table row

Although mycrosoft and mysql offers such approximation in cosy way

Best Answer

Related Solutions

Postgresql – Permissions to delete large objects

Postgresql – Log large queries with PGbouncer

Related Question