Postgresql – is it “bad” to bulk update a boolean column semi-regularly on potentially a large set of records in Postgres

postgresql

pertaining to a previous question I asked: how do you prevent dead rows from hanging around in postgresql?

If I have a table with 300k rows, and I were to update all rows in the table, setting a boolean column from false to true, is this an OK thing to do or should bulk updates over a certain size always be avoided? I'm still a bit fuzzy on the whole Vacuum thing here, my understanding is that the MVCC in postgres causes every row to be duplicated temporarily and then auto-vacuum should come in an reclaim that space. I am wondering, hypothetically, if these 300k rows were to have that boolean value flipped once a week, would I eventually run out of disk space, or would this not be a problem as long as there was enough time between the updates for auto-vacuum to come and keep things in check?

Best Answer

So, 300k rows total doesn't seem like a huge amount, I wouldn't be overly worried unless you have a particular cause for concern (e.g. your UPDATE taking way too long, holding row locks for too long, etc).

But two suggestions which may be helpful for your particular use-case:

First, make sure that your UPDATE statement does not touch rows it does not need to. If you want to set all values of some_bool_column to false, do it like this:

UPDATE mytable SET some_bool_column = false WHERE some_bool_column IS DISTINCT FROM false;

although semantically the UPDATE above is exactly the same as if it did not have a WHERE clause (ignoring triggers), you will avoid a ton of I/O for tuples which already have some_bool_column = false.

Second, if you can, try to take advantage of Heap-Only Tuples aka the HOT optimization for your UPDATEs. So if you can avoid having an index on some_bool_column, these bulk UPDATEs will be faster and may be able to avoid contributing to index bloat. Often, you don't want to have an index on a boolean column anyway, since the selectivity will be low (or if one of the boolean values is rare, just use a partial index on that value.)