PostgreSQL Performance – Efficient Data Insertion into Archiving Table

performancepostgresqlpostgresql-performance

For example lets say we have table A:

create table if not exists T(
column1 int,
column2 varchar,
column3 date
);

and archiving table TArchive:

create table if not exists TArchive(
column1 int,
column2 varchar,
column3 date
);

What would be the best approach to inserting data older than x date into TArchive without locking down table T in production? Under the assumption that table T has a big amount of rows.

I've been researching this for hours. In SQL Server you have different approaches like: https://www.brentozar.com/archive/2018/04/how-to-delete-just-some-rows-from-a-really-big-table/
But in PostgreSQL I can barely find anything.

Should you just directly take the data from the T table and import it into TArchive?

Should you import your data into a temp table first, then import that into the archive table? And if so, why would this approach be better when you're doing 2x the inserts for the same data.

How many functions should you make? One function to rule them all? Or a function for archiving and another for deleting the old data?

Are there any other approaches?

Best Answer

You probably cannot find any articles on the topic because it is so simple:

WITH old_rows AS (
   DELETE FROM t
   WHERE ...
   RETURNING *
)
INSERT INTO tarchive
SELECT * FROM old_rows;

This won't lock table t very much — concurrent SELECTs and data modifications won't be affected (unless they are trying to modify one of the deleted rows).

Your main problems are:

  • This statement may take a long time, holding locks and blocking autovacuum as it runs.

  • Afterwards, the table will not be smaller, but emptier. This hurts sequential scans and wastes cache space.

Particularly the second problem hurts, and the remedies all require making the table unavailable for a time:

  • VACUUM (FULL) will reorganize the table, but blocks any access to the table while it runs.

  • Replacing the table with a different one will block data modifications for a while:

    BEGIN;
    LOCK TABLE bloated IN SHARE ROW EXCLUSIVE MODE;
    CREATE TABLE bloated_copy (LIKE bloated INCLUDING ALL);
    INSERT INTO bloated_copy SELECT * FROM bloated;
    DROP TABLE bloated;
    ALTER TABLE bloated_copy RENAME TO bloated;
    COMMIT;
    

Because of all that, it is a good idea to plan how you want to get rid of old data right when you design your system, but in my experience this is almost always forgotten. The least painful way to get rid of old data is to partition the tables by time.