PostgreSQL – How to Perform Live Backup

backupindex-tuningpostgresqlpostgresql-9.3replication

Goal: Maintaining a constant live backup of a postgres cluster. In case of cluster failure, backup/Secondary will contain most up to date view of Primary cluster. This is not a hot standby circumstance. When the Primary postgres machine can be fixed/switched, etc. the new machine's Postgres cluster will be restored using the backup.

Current direction: Using postgres 9.3 streaming replication to a remote postgres server. (Currently unsure whether it will be synchronous or asynchronous, depending on consistency requirements of the system.)

Question: Is this a reasonable solution?

Given this solution, is there any way to tune the Secondary/Slave postgres cluster? For instance, since the backup server won't be responding to queries it doesn't need indices. Can I somehow disable these for the Secondary cluster?

In what ways can I save in disk space and memory on my Secondary/Slave cluster given the knowledge that it's only serving as a 'dumb' backup?

Best Answer

Given this solution, is there any way to tune the Secondary/Slave postgres cluster? For instance, since the backup server won't be responding to queries it doesn't need indices. Can I somehow disable these for the Secondary cluster?

In what ways can I save in disk space and memory on my Secondary/Slave cluster given the knowledge that it's only serving as a 'dumb' backup?

If you do only logical backups (pg_dump), you'll save lots of space - but you'll have much larger data loss windows.

Currently your only other options are WAL archiving and streaming replication, and you're right that they have some big overheads.

You cannot exclude indexes or unwanted tables from the replica stream. Nor can you select only some databases to replicate. It's an all-or-nothing prospect. The only exception is unlogged tables - their contents don't get replicated, and neither do contents of their indexes. So you can use unlogged tables for transient data on the primary to save some I/O, WAL space, and disk space.

There are some logical streaming systems like Londiste and Slony-I, but they're only asynchronous. So you can have a potentially unbounded data loss window if the replica isn't keeping up with the primary.

There's ongoing work into adding synchronous streaming logical replication to PostgreSQL via the BDR project, but it won't be in 9.4 and might not be in 9.5 either, so it's a long-term thing.

Thankfully, a physical replica server has quite low CPU overheads and its disk I/O overhead is usually only moderate. So long as you throw enough spare disk space it it, it won't unduly burden your other application. There isn't really much you can do to tune it, though spreading out checkpoints longer on master can help reduce writes on the replica.

One option you may wish to consider is to sacrifice recovery time for lower overheads. Use WAL archiving with an archive_timeout on the master and monitor log shipping closely. Set up alerts to warn you urgently if logs stop arriving at the destination storage. Then just do a pg_basebackup and use WAL archiving to keep an asynchronous replica. It'll take a lot longer to start up if you ever need to restore it and it needs a ton of disk space, but it has near-zero CPU and RAM overhead and does very efficient sequential disk writes that the I/O subsystem barely notices. You can also have it write to any storage you feel like, not necessarily the spare server. The key thing with this is that you must test restores carefully - make sure you can deal with the long recovery times (and measure them for your data), ensure your base backup procedures are correct, rotate your base backups regularly, etc. I recommend PgBarman, which automates a lot of this for you. You must understand though that WAL archiving is asynchronous - there's always a data loss window of up to archive_timeout at any time, and it can get a lot bigger if your WAL archiving fails due to (eg) running out of disk space, as the master server will not stop or pause if the replication stops.

So for safety, stick with synchronous streaming replication. If you need to, buy a better server.