PostgreSQL – Database Archive Solutions

archivepartitioningpostgresqlpostgresql-9.1

In continuation to a question posted by me on Is it a good idea to move high-volume and highly-accessed tables to a separate database?, am looking out for different techniques/solutions available for database archiving in PostgreSQL.

Few solutions I can think of are:

  1. Table partitioning
  2. Separate tablespace and/or schema
  3. Moving archived records/tables to a different harddisk

Any other suggestions/pointers/solutions are really welcome and appreciated.

NOTE: We're running PostgreSQL v9.1.3 on CentOS5.2

Best Answer

My suggestion about archiving:

  1. Create archive_tablespace (if you want you can separate hardware on archive)
  2. Create tables. For example we want to archive table posts.

    create table  posts_all ( LIKE public.posts)  ;
    create table  posts_archive () inherits  ( public.posts_all)  ;
    alter table  public.posts  inherits ( public.posts_all ) ;
    

    After that we will have 2 new tables: public.posts_all (with the same columns like in posts) to query all posts (archive and production) and public.posts_archive to query all archive posts. Public.posts will inherits from posts_all.
    Inserts should go in a old way (to table public.posts) unless you will write triggers on posts_all to redirect inserts to posts table. If you have partitioning it will be more complicated. With working application and before old data migration you don't have to change anything in application code to work with this approach.

  3. Create schema archive for logical separation. My suggestion will be to separate archive data by some time period (year or month) if possible (archive_2005).

  4. Create archive tables in archive_year schema

    create table archive_2005.posts (
      check(record_date >= '2005-01-01 00:00:00'::timestamp 
        and record_date <  '2006-01-01 00:00:00'::timestamp)
    ) inherits (posts_archive) tablespace archive_tablesapce;
    

    After that you will have new table posts in schema archive_2005 and postgresql planer will know that data there is only in designed time period. If you query by another time period postgresql will not search in this table.

  5. Create functions/procedures/triggers to move data to archive tables.

  6. Do archive once for a time period (year here) and vacuum old table or do it automatically by triggers (heavier on autovacuum). There are many advantages and disadvantages in both techniques.

If implemented:

  1. Can query archive (select * from posts_archive), all (select * from posts_all) and production (select * from public.posts) data separately
  2. Can do dump archive schemas separately and drop cascade on them in easy way. pg_dump -s archive_2005 datase_name drop schema archive_2005 cascade; --be careful because it removes all related tables
  3. Old data separated physically by tablespace and logically by schema.
  4. Quite complicated structure to manage archiving process
  5. Can create different indexes on production and archive tables to optimize queries to both (smaller and specialized indexes = faster queries and less space required)
  6. If you have partitioned tables (by year or month) archive process will be just to move whole table to archive_tablespace or just alter it to inherit from posts_archive (I didn't test this)
  7. If you don't want to access old (archived) data you don't have to change anything in application.

This is general technique and you should adapt it to your needs. Any suggestions to improve this?

Further reading: PostgreSQL inheritance, partitioning