PostgreSQL 9.2 – EBS Snapshot on AWS Guide

awspostgresql-9.2snapshot

I found this post here back from 2011, and wanted to verify that it's still valid and good advice to take a hot snapshot of an ebs volume (PostgreSQL still running) so long as you're snapping all of the data.

The PostgreSQL documentation indicates that a low level snapshot is fine so long as the whole data directory (WAL and all tables, different tablespaces could cause an issue) are taken in the same snapshot.

I'm currently using PostgreSQL v9.2.

Best Answer

Yes, that advice remains valid.

A low level snapshot of a volume that does atomic snapshots is much like a plug-pull or server crash. When restored from the snapshot, PostgreSQL just does normal recovery where it replays the transaction logs.

It's a perfectly sensible way to take a backup, though I recommend also taking periodic dumps. Snapshot backups won't help you in the face of undetected filesystem corruption etc.

Related Solutions

Postgresql – EC2 – How to correctly back up PostgreSQL data

See the fine manual. If my advice conflicts with its' in any way, it's right.

A sync isn't a bad idea, unless your copy tool fsync()s each WAL file it writes and the directory it's in before copying the next one. An incomplete last WAL file doesn't matter much; at worst, you just delete it. Pg will generally choke on an incomplete WAL - though there's no checksumming done, so you could be really unlucky and have it try to apply garbage data that by sheer insane chance happened to look like real WAL records. In your position I'd be syncing the volume before a snapshot to make sure any unwritten dirty buffers in RAM hit the file system image on disk. A freeze would help avoid messy but non-fatal partially written WALs, so it's not a terrible idea but not vital. What's vital is to have an undamaged timeline up until the point of recovery. Personally, I write my WALs to a temporary file name and rename them to their final name only once fully copied; if you do this, you don't need to freeze.
Sounds correct. A live snapshot is just like doing a plug pull test on a live system with write-through caching. Your database should recover fine when restored from a live snapshot, same as after plug-pull. I'd recommend that you automate tests of restores from snapshots. (Note: A snapshot restore test is not a complete substitute for plug pull testing because it doesn't account for possible disk, raid controller, etc write caching). Not only the config files and the dump, but the database its self should be fine after your snapshot. Consider syncing the volume before the snapshot to make sure all the dump data etc has actually hit disk.

2a. Might save some disk space. Little difference otherwise. You'll get to keep the snapshots a lot longer without all the churn of the live database on them.
Why even snapshot your code volume? A plain file level copy may well be just fine. Certainly a live snapshot should be.
This is not a solid backup scheme. It fails in one critical area: There is no restore testing and validation being performed. You should always test your backups on a regular basis to make sure you can really restore them.

Personally, I recommend that you use WAL shipping, or send database dumps, to a different host, preferably one not on Amazon EC2 or at least in a different region. This host should perform automated restore tests, send reports to you of the results, and should also be checked manually.

While your snapshots (containing dumps) will be on S3, and will be safe there, that doesn't mean they'll be accessible when you need them urgently. Amazon's durability claims are reassuring, but your data can still be safe and completely inaccessible to you during a badly timed outage of the S3 service.

PostgreSQL – How to Determine if Hot Standby is Fully Mirrored

I think this is normal and expected if your restore_command is set to something like this example:

restore_command = 'cp /mnt/server/archivedir/%f "%p"'

The manual says that:

At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

So you can expect to see exactly one restore_command failure when you start your standby, because PostgreSQL will keep calling it (with incrementing log file names/numbers) until it fails once.

Then it will connect to the primary and start streaming as described above, and as you saw in your logs:

LOG:  streaming replication successfully connected to primary

The slave is not guaranteed to be exactly up-to-date with the master, because it could be disconnected from the master for example. In particular, this line:

LOG:  consistent recovery state reached at 31/B73624A0

does not mean that "the hot standby contains all the data of the master". However, if you see it followed by this line, as you did:

LOG:  database system is ready to accept read only connections

then the database is "ready enough" to start functioning as a read-only standby, as the manual says:

It may take some time for Hot Standby connections to be allowed, because the server will not accept connections until it has completed sufficient recovery to provide a consistent state against which queries can run. During this period, clients that attempt to connect will be refused with an error message.

In my case, I saw consistent recovery state reached not followed by database system is ready to accept read only connections. This turned out to be a problem with an embedded scripting language plugin (plpython2) having a system-wide startup script (sitecustomize.py) which did bad things to the PostgreSQL process (enabling faulthandler and installing a signal handler for SIGUSR2) which caused it to never enter hot standby mode.

Best Answer

Related Solutions

Postgresql – EC2 – How to correctly back up PostgreSQL data

PostgreSQL – How to Determine if Hot Standby is Fully Mirrored

Related Question