Short version
Your algorithm looks O(n*m) on first glance, but effectively grows O(n * m^2), because all rows have the same ID. Instead of 5M rows, you are getting >1.25G rows
Long version
Your function is inside an implicit transaction. That's why you see no data after cancelling your query, and also why it needs to maintain distinct versions of the updated/inserted tuples for both loops.
Additionally, I suspect you have a bug in your logic or underestimating the number of updates made.
First iteration of the outer loop - current_id starts at 1, inserts 1 row, then the inner loop performs an update 10000 times for the same row, finalizing with the only row showing an ID of 10001, and current_id with a value of 10001. 10001 versions of the row are still kept, as the transaction is not finished.
Second iteration of the outer loop - as current_id is 10001, a new row is inserted with ID 10001. Now you have 2 rows with the same "ID", and 10003 versions in total of both rows (10002 of the first one, 1 of the second one). Then the inner loop updates 10000 times BOTH rows, creating 20000 new versions, getting to 30003 tuples so far...
Third iteration of the outer loop: current ID is 20001, a new row is inserted with ID 20001. You have 3 rows, all with same "ID" 20001, 30006 row/tuples versions so far. Then you perform 10000 updates of 3 rows, creating 30000 new versions, now 60006...
...
(If your space had allowed) - 500th iteration of the outer loop, creates 5M updates of 500 rows, just in this iteration
As you see, instead of your expected 5M updates, you got 1000 + 2000 + 3000 + ... + 4990000 + 5000000 updates (plus change), which would be 10000 * (1+2+3+...+499+500), over 1.25G updates. And of course a row is not just the size of your int, it needs some additional structure, so your table and index gets over ten gigabytes size.
Related Q & A:
Of course, there are actually many ways for a SELECT
to take disk space. Just look at the docs for work_mem
work_mem (integer)
Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults to four megabytes (4MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.
Even with a high work_mem
a SELECT
can chew away disk space. You can for instance, have a function required that writes to disk directly.
You may be interested in mitigating this by using temp_file_limit
temp_file_limit (integer)
Specifies the maximum amount of disk space that a process can use for temporary files, such as sort and hash temporary files, or the storage file for a held cursor. A transaction attempting to exceed this limit will be canceled. The value is specified in kilobytes, and -1 (the default) means no limit. Only superusers can change this setting.
Best Answer
I'd say the difference is that your PL/PgSQL procedure runs in a single transaction.
If you run line by line in
psql
, unless you explicitlyBEGIN
andCOMMIT
, you're running in individual transactions. This can be quite a lot slower, but it also means that autoVACUUM
can come along to free and reuse deleted rows, so subsequentUPDATE
s can write new row versions into those freed spaces.If you're in a single transaction, the system has to keep the old row versions around because it needs them in case you
ROLLBACK
the transaction or hit an error.So you can't really do what you want in PL/pgSQL alone, unless you're willing to use hacks like
dblink
. You need to batch your work into a series of smaller transactions and since PostgreSQL doesn't yet support autonomous commit from within a PL/PgSQL function, that means dblink or an external client.