PostgreSQL Replication – Can I Do pg_start_backup() on a Live, Running DB Under Load?

postgresqlreplication

Our established replication has broken ("requested WAL segment has already been removed" during downtime)
We cannot easily stop the master again.

Can we do

  1. pg_start_backup(),
  2. rsync ${PGDATA}/ master to slave,
  3. pg_stop_backup()

… while the master postgresql is still under full load?
(Or will pg_start_backup() lead to

  • table locks,
  • I/O blocks,
  • inconsistencies,
  • fire alarm,
  • slow db response

In other words, will pg_start_backup() affect our application?

Best Answer

pg_start_backup will perform a checkpoint, as dezso notes. This does have an impact, but your database performs checkpoints quite regularly anyway, and must do so to function, so they're clearly not a problem for you. An early checkpoint means that less data has been accumulated, meaning that if anything a checkpoint from pg_start_backup will be lower-impact than normal.

Where you need to worry is the rsync or equivalent pg_basebackup step. The read I/O from this won't be too bad since it's sequential, but it'll still probably significantly hurt your database's I/O performance, and it'll also tend to push hot data out of RAM cache in favour of less-used data, causing cache thrashing as the more-needed data is then read back in.

You can use nice and ionice to help limit the I/O impact (but not the cache impact); however, there's a cost to that. The backup will take longer, and until you complete the backup and run pg_stop_backup your system is - as I understand it - accumulating WAL it cannot delete, accumulating checkpoint debt for a BIG checkpoint at the end of the backup run, and is accumulating table and index bloat because it can't clean up dead rows. So you really can't afford to have the backup take forever, especially if you have very high churn tables.

In the end, it's hard to say whether you can safely use pg_start_backup and pg_stop_backup for hot backups in your environment. Most people can, but if you're close to the edge of what your hardware can do, have tight timing requirements, cannot afford the risk of a stall, and have very high churn tables as well as very big tables, it might be troublesome.

Unfortunately, you pretty much need to test it and see.

If you can, it might be worth issuing a CHECKPOINT then taking an atomic snapshot of the volume your database is on instead using LVM, your SAN's tools, EBS, or whatever you're on. If you can do this, you can then copy the snapshot at your leisure. This approach isn't suitable for taking a base backup for PITR/warm standby/hot standby, but it's perfectly good for a static backup copy, and is much lower impact on the system. You can only do this if your snapshots are atomic and your entire database including WAL is on a single volume, though.

One possibility I haven't yet investigated is combining the two approaches. It occurs to me that one could possibly (untested and possibly wrong and unsafe, I don't know yet):

  • pg_start_backup
  • Trigger snapshots of all tablespaces, the main datadir, and the xlog volume
  • pg_stop_backup
  • Copy WAL up to the final archive from pg_stop_backup
  • Copy the data from the snapshotted volumes

Essentially, the idea is to reduce how long the DB has to be delaying its checkpoints by taking a point-in-time of each volume that you can copy at your leisure.