PostgreSQL – Rewriting Entire Table

postgresql

We have a database that is just over 200 gb in size, with around 430 million rows, a primary key and a spatial index. Every few weeks we get an update which involves around 3 million rows, deleting some old features and updating new ones. This is done offline.
We also have a history table, which stores the accumulated deletions from each update. This table is currently around 20 gb and 40 has million rows.
Essentially, we first insert into the history table anyting that is marked as to delete in the change table, then delete everything from the main table, that is either to delete or new, and then insert into the new table all the changed records (all offline, to be clear).

If I run this in a single procedure/function a table rewrite seems to be triggered, as I have come back after a few hours and found it failed due to using up the available disk space, around 100 gb. If I run it line by line in a psql console, no such problem. As there are no DDL statements and the total number of changed rows is vastly smaller than the total database size, I am puzzled as to what might be causing this amount of temporary disk usage, as the total space required to record the entire change as one transaction would seem to be much less.

Is there some rule of thumb for this or a simple diagnostic query that I can run to see what is happening?

Best Answer

I'd say the difference is that your PL/PgSQL procedure runs in a single transaction.

If you run line by line in psql, unless you explicitly BEGIN and COMMIT, you're running in individual transactions. This can be quite a lot slower, but it also means that autoVACUUM can come along to free and reuse deleted rows, so subsequent UPDATEs can write new row versions into those freed spaces.

If you're in a single transaction, the system has to keep the old row versions around because it needs them in case you ROLLBACK the transaction or hit an error.

So you can't really do what you want in PL/pgSQL alone, unless you're willing to use hacks like dblink. You need to batch your work into a series of smaller transactions and since PostgreSQL doesn't yet support autonomous commit from within a PL/PgSQL function, that means dblink or an external client.