Postgresql – Postgres fast update on non-indexed column

deleteperformancepostgresqlpostgresql-performance

Step 1

We have a delete query of the form that we are trying to speed up:

DELETE * FROM table_name 
WHERE col_name in ('a','b',....'zzzz');

The operation deletes between 0.5-50% of the mass of the table. col_name is an indexed (non-unique) column.

This ran extremely slowly because each delete affected the index.

Step 2

We used a non-indexed tombstone boolean column called deleted with a DEFAULT FALSE. Our query now became:

UPDATE table_name 
SET deleted = TRUE 
WHERE col_name in ('a','b',....'zzzz');

This definitely runs quicker (60-200%), but seems to ignore the col_name index for large IN clauses. However, since the update only applies to an unindexed column, it is fast.

Step 3

We replaced the conditional to be:

UPDATE table_name 
SET deleted = TRUE 
WHERE col_name = 'a' 
    OR col_name = 'b' 
    OR ... 
    OR col_name = 'zzzz';

Even though this utilizes the index, it runs at about the same speed as the DELETE from Step 1.

Is there a fast way to delete (or mark as deleted) a number of rows based on membership within a very large IN clause?

The database needs no concurrency handling as it is accessed by a dedicated single-threaded application.

Note: Individually performing the deletes/updates was an order of magnitude slower. The IN clause generally has between 20000 and 5 million elements.

Best Answer

I'm afraid there isn't any significant improvement to your scenario.

A few things to consider:

If your data statistics are good enough (i.e.: autovacuum has had the oportunity to properly ANALYZE your data), it is likely that PostgreSQL will use an Index Scan when you DELETE on the low side of the 0.5 .. 50% range, and a Sequential Scan on the high side. It's a matter of the estimated cost and the threshold where it will change from one method to the other will depend on some cost parameters you could actually tune.
It isn't likely that there are any faster ways... in any case the database need to locate which rows to delete (using an index or not), and delete them. For PostgreSQL, deleting a row is akin to changing the value of some hidden column (xmax), and later on, reclaiming the unused space through (auto)VACUUM. Other databases (specially, if they're non-transactional) perform this operation in significant different ways. Normally you trade higher speed by much lower safety/security.
The way PostgreSQL Multi Version Concurrency Control MVCC works, an UPDATE is basically equivalent to a DELETE followed by an INSERT. If you don't touch any indexed column, this UPDATE can take place as a Heap-Only Tuple (HOT) update. This won't have an effect when you delete a huge percentage of your table, because it isn't very likely that there is enough free room in your pages. In practice: don't expect an UPDATE to be faster than a DELETE. Your soft-deletes aren't likely to be faster than hard-deletes (as you already have been able to test).
If your data is not critical to you (i.e.: you use the database to process information that you actually have also stored somewhere else, and whose loss would then not be critical, because you could retrieve it again from sources), you can Use Unlogged Tables. You could also set some aggresive value to some of the database parameters: synchronous_commit (=off) and fsync (=off). I would discourage you from doing so unless your needs for speed are critical and the data can actually be easily recreated.

Related Solutions

Sql-server – SQL Server 2012 – Frequent SELECT and UPDATE on bit column. Index doesn’t help

An index on the bit column isn't going to help at all because of selectivity. You should consider a filtered index:

CREATE INDEX unindexed ON dbo.Documents(documentId) WHERE IsIndexed = 0;

You should also include an ORDER BY documentId in your SELECT.

Though you should test this in a staging environment because it may help your SELECT but may offset that with a more expensive UPDATE.

PostgreSQL – Slow Updates When Other Columns Are Indexed

We're not creating more data, so existing data shouldn't need to be moved around on the HDD, it just needs to be overwritten

That's not the case.

In order to support rollback and crash-safety, PostgreSQL must write a new copy of every modified row, rather than modifying the row in-place. Twice, actually, because it must be written to WAL (a sequential log for crash recovery) then to the table.

PostgreSQL's autovacuum then comes along later and marks the old row versions as free space that can be re-used.

See the user manual for more information on concurrency control and MVCC.

the data is not changing size (e.g., rewriting a boolean value)

Irrelevant, because the row must be rewritten anyway.

only one column is being updated, and it is non-indexed
there are indexes on other columns

This only matters insofar as it affects HOT updates, where PostgreSQL can potentially avoid writing new index entries for a row update if no indexed columns are modified and there's enough free space on the same disk page (8k block) to store a new copy of the row.

Why do indexes on columns irrelevant to this query balloon the update time?

In most cases PostgreSQL must add new index entries even if you didn't modify indexed columns, because it has to write a new version of the row to a separate database page.

Setting a FILLFACTOR of 50 or less can help with this, as it encourages PostgreSQl to keep more free space for updates, at the cost of scans having to read and process more data.

Best Answer

Related Solutions

Sql-server – SQL Server 2012 – Frequent SELECT and UPDATE on bit column. Index doesn’t help

PostgreSQL – Slow Updates When Other Columns Are Indexed

Related Question