Postgresql – Storage snapshots for consistent backup of postgresql – different data and log volumes

postgresqlstoragevirtualisation

We are running many Linux VM's in a vmware/shared storage environment, each running its own instance of postgreSQL (a mix of 9.0 and 9.3). Currently, the entire VM sits on a single root partition/volume, and we've had great success (~8 years) using storage-based snapshots of the underlying VMFS volumes for backup/restore process (and replication to our DR site).

Due to the architecture of our storage, it would be advantageous to separate postgres WAL files to a non-cached, mostly-write volume to give us less cache churn on the storage side. With our storage (Nimble Storage), we can assign both volumes to a single protection/snapshot group, but I haven't been able to elicit from our vendor that the snapshots will happen at EXACTLY the same time across all volumes in the protection group – it likely will, but there's always that chance that its milliseconds apart.

To that end, we ran some experiments, all while writing data to the DB as fast as possible using pg_bench. After the experiments, we restored our snapshot'ed volumes and started the VM+postgres

  • Snapshot both data and log volumes close to simultaneously – result: DB recovered
  • Snapshot data volume first, log volume ~1 minute later – result: DB recovered
  • Snapshot log volume first, data volume ~1 minute later – result: DB recovered
  • Snapshot log volume first, data volume ~3 minutes later, after a WAL checkpoint wrote new data to datafiles: result: DB recovered

So testing seem to tell us as long as both snapshots are consistent at the volume level, and relatively close together, you get a consistent copy of the DB, based on the time of the WAL/Log volume snapshot.

My question: Is this safe? What are the corner cases we are missing in our testing, and what could go wrong?

Postgres' doc indicates this is not safe, but testing seems to indicate its pretty robust:
http://www.postgresql.org/docs/9.1/static/backup-file.html

If your database is spread across multiple file systems, there might not be any way to obtain exactly-simultaneous frozen snapshots of all the volumes. For example, if your data files and WAL log are on different disks, or if tablespaces are on different file systems, it might not be possible to use snapshot backup because the snapshots must be simultaneous. Read your file system documentation very carefully before trusting the consistent-snapshot technique in such situations.

NOTE: Yes, we know about other options to make sure they are consistent, like putting PostgreSQL into hot backup mode or using our storage's VMware integration to quiesce the VM's themselves, but we are looking for a storage-only solution for speed, convenience, and zero impact to our clients.

Best Answer

The documentation you cited says it all, but I wouldn't blame you if you want to try to verify the claims of the vendor regarding snapshots taken at the same time. Perhaps a way of uncovering something could be to stress test the WAL system more specifically.

e.g. In addition to your pgbench-based tests, try adding random calls to pg_switch_xlog() to force log rotation, shorter and longer checkpoint intervals (shortening and lengthening checkpoint_timeout and checkpoint_timeout) and even using small or large wal file sizes.

Unless there's something I'm missing, for your snapshots not taken at the same time, I would attribute your recovered DBs perhaps to a bit of lucky timing. In the last case, imagine you took your log snapshot while the current xlog location was, say, 0/A1C0FFEE. Then you have 3 minutes of particularly heavy load on the system, that causes a full cycle through the WAL files, and your DB is now at 0/DEADBEEF when the data snapshot is taken. When you attempt to restore, the WAL files being written to at the time of the data snapshot are long gone, and recovery will fail.