Postgresql – EC2 – How to correctly back up PostgreSQL data

postgresqlsnapshot

Here is the setup: 1 small Amazon Linux (EBS-backed) EC2 instance with 3 additional volumes. This is both a web server and database server. One volume for code, one for the PostgreSQL (8.4) data directory, and one volume to store WAL files from PostgreSQL.

(1) The volume with WAL files will also have a base backup of the data directory, which is copied over after doing a pg_start_backup(). Then it will store the continuous archive output from PostgreSQL (WAL files). To snapshot this volume, is there any point in issuing a sync and freezing the filesystem (using xfs_freeze if it's XFS or dmsetup if it's EXT4)? Or can I just take a live snapshot? WAL files will be shipping at a rate of one per minute. Is it possible that a snapshot could be initiated while a single WAL file is being copied over and resulting in corrupt data?

(2) The volume that contains the live PostgreSQL data directory will also be backed up for good measure (daily). Before doing a snapshot of this volume, I issue a pg_dump and the resulting SQL file is kept in the data directory. Is there any point in taking precautions to ensure the actual database data is consistent? Would it be correct to assume that taking a live snapshot will properly (a) backup config files (postgresql.conf, pg_hba.conf, pg_ident.conf) and (b) backup the SQL dump file. Backing up those two things, sql dump file and config files, would be the main point of snapshotting this volume. The DB is not very big so I don't mind the fact that data files will bloat this snapshot. And in that case, I can just do a live snapshot — correct?

(2a) Would it just be better to keep the data directory on the root volume, and have a backup script that copies the sql dump file as well as config files onto another volume, and snapshot that volume once the copy is done?

(3) As for the volume with code on it, again is there any point in sync'ing and freezing the filesystem? Or can just a live snapshot be taken? This data should be fairly "static".

(4) Is this a solid backup scheme? The root volume is not backed up on a regular basis since I will just keep a machine image after it is set up and configured.

Thanks

Best Answer

See the fine manual. If my advice conflicts with its' in any way, it's right.

  1. A sync isn't a bad idea, unless your copy tool fsync()s each WAL file it writes and the directory it's in before copying the next one. An incomplete last WAL file doesn't matter much; at worst, you just delete it. Pg will generally choke on an incomplete WAL - though there's no checksumming done, so you could be really unlucky and have it try to apply garbage data that by sheer insane chance happened to look like real WAL records. In your position I'd be syncing the volume before a snapshot to make sure any unwritten dirty buffers in RAM hit the file system image on disk. A freeze would help avoid messy but non-fatal partially written WALs, so it's not a terrible idea but not vital. What's vital is to have an undamaged timeline up until the point of recovery. Personally, I write my WALs to a temporary file name and rename them to their final name only once fully copied; if you do this, you don't need to freeze.

  2. Sounds correct. A live snapshot is just like doing a plug pull test on a live system with write-through caching. Your database should recover fine when restored from a live snapshot, same as after plug-pull. I'd recommend that you automate tests of restores from snapshots. (Note: A snapshot restore test is not a complete substitute for plug pull testing because it doesn't account for possible disk, raid controller, etc write caching). Not only the config files and the dump, but the database its self should be fine after your snapshot. Consider syncing the volume before the snapshot to make sure all the dump data etc has actually hit disk.

    2a. Might save some disk space. Little difference otherwise. You'll get to keep the snapshots a lot longer without all the churn of the live database on them.

  3. Why even snapshot your code volume? A plain file level copy may well be just fine. Certainly a live snapshot should be.

  4. This is not a solid backup scheme. It fails in one critical area: There is no restore testing and validation being performed. You should always test your backups on a regular basis to make sure you can really restore them.

    Personally, I recommend that you use WAL shipping, or send database dumps, to a different host, preferably one not on Amazon EC2 or at least in a different region. This host should perform automated restore tests, send reports to you of the results, and should also be checked manually.

    While your snapshots (containing dumps) will be on S3, and will be safe there, that doesn't mean they'll be accessible when you need them urgently. Amazon's durability claims are reassuring, but your data can still be safe and completely inaccessible to you during a badly timed outage of the S3 service.