PostgreSQL – How to SELECT and TRUNCATE Without Losing New Data

postgresqltransactiontruncate

I have an application that writes raw log files to a table called new in a PostgreSQL database — lots of rows and lots of duplicate data. Via a cron job running every minute, I normalize the raw data and insert the normalized results into a table called summary, after which I truncate the new table.

The normalization is done all via SQL, but I realize that it may take some time do execute the SQL statement, during which time additional data may get inserted into the new table. This leads me to believe that said additional data inserted while the SQL statement is running may get lost when the table is truncated.

Not knowing exactly how PostgreSQL (or any other SQL-compliant database) handles access to tables when SQL statements are being executed, I'm not exactly sure what to do to ensure no data is lost. Is there a way to "move" all the data in the new table into a temporary table and perform the normalization on the data in the temporary table, leaving the new table empty until new data gets inserted (which will likely happen during the normalization of the data in the temporary table)? Perhaps there's additional/better approaches to this problem, perhaps using transaction blocks?

—- EDIT —-

It never fails… every time I write up a question on Stack Exchange, it gets me to think about the question in different ways and thus leads me to new terms to search for on Google. Happened this time too. As a result, is this answer a potential solution for what I'm wanting to do here? Is it the best solution, given it's now quite dated?

Best Answer

There is more than one possibility here.

One is what Peter Eisentraut suggested in the post you linked, using a writable CTE which deletes the rows from the table which will be processed or inserted elsewhere in the same query.

This approach may be viable with small datasets. However, there are chances that some table bloat will occur with time, or concurrent inserts and deletes clash on the disk I/O level.

Another way to solve this is using transaction blocks, as you also suggested. If there are open transactions (that do the inserts, for example) on new, the TRUNCATE command will wait until these are closed, then it will remove everything. This means that the problem you anticipate really exists :)

Consider the following:

BEGIN;

LOCK new; -- takes an exclusive lock

INSERT INTO staging_table SELECT * FROM new;

TRUNCATE new;

COMMIT;

In this case, the exclusive lock will make all concurrent INSERT statements wait until the COMMIT. As the INSERT in this block should be relatively fast, the waiting time won't be terribly long, while your existing data is safely moved to an other table. Then comes the TRUNCATE, finally the COMMIT will release the lock and the waiting inserts can be done.

This way you can move the data processing from new to the newly introduced table.