PostgreSQL Performance – Efficient Data Insertion into Archiving Table

performancepostgresqlpostgresql-performance

For example lets say we have table A:

create table if not exists T(
column1 int,
column2 varchar,
column3 date
);

and archiving table TArchive:

create table if not exists TArchive(
column1 int,
column2 varchar,
column3 date
);

What would be the best approach to inserting data older than x date into TArchive without locking down table T in production? Under the assumption that table T has a big amount of rows.

I've been researching this for hours. In SQL Server you have different approaches like: https://www.brentozar.com/archive/2018/04/how-to-delete-just-some-rows-from-a-really-big-table/
But in PostgreSQL I can barely find anything.

Should you just directly take the data from the T table and import it into TArchive?

Should you import your data into a temp table first, then import that into the archive table? And if so, why would this approach be better when you're doing 2x the inserts for the same data.

How many functions should you make? One function to rule them all? Or a function for archiving and another for deleting the old data?

Are there any other approaches?

Best Answer

You probably cannot find any articles on the topic because it is so simple:

WITH old_rows AS (
   DELETE FROM t
   WHERE ...
   RETURNING *
)
INSERT INTO tarchive
SELECT * FROM old_rows;

This won't lock table t very much — concurrent SELECTs and data modifications won't be affected (unless they are trying to modify one of the deleted rows).

Your main problems are:

This statement may take a long time, holding locks and blocking autovacuum as it runs.
Afterwards, the table will not be smaller, but emptier. This hurts sequential scans and wastes cache space.

Particularly the second problem hurts, and the remedies all require making the table unavailable for a time:

VACUUM (FULL) will reorganize the table, but blocks any access to the table while it runs.

Replacing the table with a different one will block data modifications for a while:

BEGIN;
LOCK TABLE bloated IN SHARE ROW EXCLUSIVE MODE;
CREATE TABLE bloated_copy (LIKE bloated INCLUDING ALL);
INSERT INTO bloated_copy SELECT * FROM bloated;
DROP TABLE bloated;
ALTER TABLE bloated_copy RENAME TO bloated;
COMMIT;

Because of all that, it is a good idea to plan how you want to get rid of old data right when you design your system, but in my experience this is almost always forgotten. The least painful way to get rid of old data is to partition the tables by time.

`TRUNCATE` is fastest

Generally, if large parts of the table are changed, a TRUNCATE / INSERT from function is probably the fastest way. If done in the same transaction, Postgres does not need to write WAL (since we start from scratch anyway). Also, you get a pristine table without bloat, which reflects positively on the next iteration of this process .. For big tables drop and recreate indexes etc. Details in this related answer:

Best way to populate a new column in a large table?

The rest only applies if you want to keep existing rows in place for some reason.

If triggers are in the way (as you write and as I am not convinced, but let's assume for now). Or if you have additional columns in the table that cannot be lost.

Depending on how many rows are in the change set (returned from the function) ...

Few rows

Less than ~ 1000, it depends on many factors. A data-modifying CTE (with an automatic, cheap, internal temp table for the function results) is probably fastest:

WITH x AS (SELECT * FROM calculate_users_info())
, del AS (
   DELETE FROM user_info t
   WHERE  NOT EXISTS (
      SELECT 1 FROM x
      WHERE  user_id = t.user_id
      AND    info_id = t.info_id
      )
, upd AS (
    UPDATE user_info t 
    SET   (info1,   info2,   info3,   info4,   info5)
      = (x.info1, x.info2, x.info3, x.info4, x.info5)
    FROM   x
    WHERE  x.user_id = t.user_id
    AND    x.info_id = t.info_id
    AND   (x.info1 <> t.info1 OR 
             x.info2 <> t.info2 OR
             x.info3 <> t.info3 OR
             x.info4 <> t.info4 OR
             x.info5 <> t.info5)
    )
INSERT INTO user_info
      (user_id, info_id, info1, info2, info3, info4, info5) 
SELECT user_id, info_id, info1, info2, info3, info4, info5
FROM   x
WHERE  NOT EXISTS (
    SELECT 1 
    FROM   user_info t3
    WHERE  t3.user_id = t2.user_id 
    AND    t3.info_id = t2.info_id
    )
;

Many rows

In this case, your script building on a temporary table looks mostly fine. The major advantage is the index on the temp table - of which you did not properly take advantage, due to missing statistics.

I also have a couple of other suggestions:

CREATE TEMP TABLE user_info_tmp ON COMMIT DROP AS  -- directly from SELECT
SELECT * FROM calculate_users_info();

CREATE INDEX user_info_tmp_idx ON user_info_tmp (user_id, info_id);

ANALYZE user_info_tmp;     -- !!!

DELETE FROM user_info t    -- with EXISTS semi-anti-join
WHERE  NOT EXISTS (
   SELECT 1 FROM user_info_tmp
   WHERE  user_id = t.user_id
   AND    info_id = t.info_id
   );

ANALYZE user_info;         -- only if large parts have been removed

UPDATE user_info t         -- with short syntax
SET   (info1,   info2,   info3,   info4,   info5)
  = (x.info1, x.info2, x.info3, x.info4, x.info5)  -- shorter, not faster
FROM   user_info_tmp x 
WHERE  x.user_id = t.user_id
AND    x.info_id = t.info_id
AND   (x.info1 <> t.info1 OR x.info2 <> t.info2 OR x.info3 <> t.info3
    OR x.info4 <> t.info4 OR x.info5 <> t.info5);

INSERT INTO user_info      -- with join syntax
      (user_id, info_id, info1, info2, info3, info4, info5) 
SELECT user_id, info_id, info1, info2, info3, info4, info5
FROM   user_info_tmp x
LEFT   JOIN user_info u USING (user_id, info_id)
WHERE  u.user_id IS NULL;  -- shorter, maybe faster

Major points

Temp tables are not analyzed automatically. Also, tables created in the same transaction and used immediately generally don't give autovacuum a chance to kick in. Details:
Are regular VACUUM ANALYZE still recommended under 9.1?

For both reasons you need to run ANALYZE on the table manually, best in the time line where I put it. This avoids grossly misguided query plans. Minor additional optimization: you might set the statistics target for irrelevant columns to 0 - all except (user_id, info_id) in the example.
Don't use GLOBAL for temp table. Per documentation:

GLOBAL or LOCAL

Ignored for compatibility. Use of these keywords is deprecated; refer to CREATE TABLE for details.
You can create the temp table from the function result automatically. Much shorter code and a bit faster, too.
Consider the shorter syntax variants, which mostly shorten the code and won't change performance much.

`work_mem` / `temp_bufers`

Either way, you need enough RAM to make it fast. Hardly relevant for small cardinalities, but crucial for big tables. A couple of thousands should not touch any memory limits in a halfway decent setup. For more, be sure to allocate enough work_mem for the CTE or temp_bufers for the temp table. Related answer:

Optimizing bulk update performance in PostgreSQL

Best Answer

Related Solutions

PostgreSQL import CSV File causes Syntax Error

PostgreSQL 9.x – Efficient Way to Insert/Update/Delete Table Records from Complex Query

TRUNCATE is fastest

Few rows

Many rows

Major points

work_mem / temp_bufers

Related Question

`TRUNCATE` is fastest

`work_mem` / `temp_bufers`