Postgresql – Postgres fast update on non-indexed column

deleteperformancepostgresqlpostgresql-performance

Step 1

We have a delete query of the form that we are trying to speed up:

DELETE * FROM table_name 
WHERE col_name in ('a','b',....'zzzz');

The operation deletes between 0.5-50% of the mass of the table. col_name is an indexed (non-unique) column.

This ran extremely slowly because each delete affected the index.

Step 2

We used a non-indexed tombstone boolean column called deleted with a DEFAULT FALSE. Our query now became:

UPDATE table_name 
SET deleted = TRUE 
WHERE col_name in ('a','b',....'zzzz');

This definitely runs quicker (60-200%), but seems to ignore the col_name index for large IN clauses. However, since the update only applies to an unindexed column, it is fast.

Step 3

We replaced the conditional to be:

UPDATE table_name 
SET deleted = TRUE 
WHERE col_name = 'a' 
    OR col_name = 'b' 
    OR ... 
    OR col_name = 'zzzz';

Even though this utilizes the index, it runs at about the same speed as the DELETE from Step 1.

Is there a fast way to delete (or mark as deleted) a number of rows based on membership within a very large IN clause?

The database needs no concurrency handling as it is accessed by a dedicated single-threaded application.

Note: Individually performing the deletes/updates was an order of magnitude slower. The IN clause generally has between 20000 and 5 million elements.

Best Answer

I'm afraid there isn't any significant improvement to your scenario.

A few things to consider:

  1. If your data statistics are good enough (i.e.: autovacuum has had the oportunity to properly ANALYZE your data), it is likely that PostgreSQL will use an Index Scan when you DELETE on the low side of the 0.5 .. 50% range, and a Sequential Scan on the high side. It's a matter of the estimated cost and the threshold where it will change from one method to the other will depend on some cost parameters you could actually tune.
    It isn't likely that there are any faster ways... in any case the database need to locate which rows to delete (using an index or not), and delete them. For PostgreSQL, deleting a row is akin to changing the value of some hidden column (xmax), and later on, reclaiming the unused space through (auto)VACUUM. Other databases (specially, if they're non-transactional) perform this operation in significant different ways. Normally you trade higher speed by much lower safety/security.

  2. The way PostgreSQL Multi Version Concurrency Control MVCC works, an UPDATE is basically equivalent to a DELETE followed by an INSERT. If you don't touch any indexed column, this UPDATE can take place as a Heap-Only Tuple (HOT) update. This won't have an effect when you delete a huge percentage of your table, because it isn't very likely that there is enough free room in your pages. In practice: don't expect an UPDATE to be faster than a DELETE. Your soft-deletes aren't likely to be faster than hard-deletes (as you already have been able to test).

  3. If your data is not critical to you (i.e.: you use the database to process information that you actually have also stored somewhere else, and whose loss would then not be critical, because you could retrieve it again from sources), you can Use Unlogged Tables. You could also set some aggresive value to some of the database parameters: synchronous_commit (=off) and fsync (=off). I would discourage you from doing so unless your needs for speed are critical and the data can actually be easily recreated.