PostgreSQL Replication – Can I Do pg_start_backup() on a Live, Running DB Under Load?

postgresqlreplication

Our established replication has broken ("requested WAL segment has already been removed" during downtime)
We cannot easily stop the master again.

Can we do

pg_start_backup(),
rsync ${PGDATA}/ master to slave,
pg_stop_backup()

… while the master postgresql is still under full load?
(Or will pg_start_backup() lead to

table locks,
I/O blocks,
inconsistencies,
fire alarm,
slow db response

In other words, will pg_start_backup() affect our application?

Best Answer

pg_start_backup will perform a checkpoint, as dezso notes. This does have an impact, but your database performs checkpoints quite regularly anyway, and must do so to function, so they're clearly not a problem for you. An early checkpoint means that less data has been accumulated, meaning that if anything a checkpoint from pg_start_backup will be lower-impact than normal.

Where you need to worry is the rsync or equivalent pg_basebackup step. The read I/O from this won't be too bad since it's sequential, but it'll still probably significantly hurt your database's I/O performance, and it'll also tend to push hot data out of RAM cache in favour of less-used data, causing cache thrashing as the more-needed data is then read back in.

You can use nice and ionice to help limit the I/O impact (but not the cache impact); however, there's a cost to that. The backup will take longer, and until you complete the backup and run pg_stop_backup your system is - as I understand it - accumulating WAL it cannot delete, accumulating checkpoint debt for a BIG checkpoint at the end of the backup run, and is accumulating table and index bloat because it can't clean up dead rows. So you really can't afford to have the backup take forever, especially if you have very high churn tables.

In the end, it's hard to say whether you can safely use pg_start_backup and pg_stop_backup for hot backups in your environment. Most people can, but if you're close to the edge of what your hardware can do, have tight timing requirements, cannot afford the risk of a stall, and have very high churn tables as well as very big tables, it might be troublesome.

Unfortunately, you pretty much need to test it and see.

If you can, it might be worth issuing a CHECKPOINT then taking an atomic snapshot of the volume your database is on instead using LVM, your SAN's tools, EBS, or whatever you're on. If you can do this, you can then copy the snapshot at your leisure. This approach isn't suitable for taking a base backup for PITR/warm standby/hot standby, but it's perfectly good for a static backup copy, and is much lower impact on the system. You can only do this if your snapshots are atomic and your entire database including WAL is on a single volume, though.

One possibility I haven't yet investigated is combining the two approaches. It occurs to me that one could possibly (untested and possibly wrong and unsafe, I don't know yet):

pg_start_backup
Trigger snapshots of all tablespaces, the main datadir, and the xlog volume
pg_stop_backup
Copy WAL up to the final archive from pg_stop_backup
Copy the data from the snapshotted volumes

Essentially, the idea is to reduce how long the DB has to be delaying its checkpoints by taking a point-in-time of each volume that you can copy at your leisure.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

What to Do with WAL Files for Postgres Slave Reset

The message:

requested WAL segment 00000001000018F70000008A has already been removed

Means that the master hasn't kept enough history to bring the standby back up to date. Since you are using version 9.1, you can use pg_basebackup to create a new slave. We use a command like:

pg_basebackup -h masterhost -U postgres -D path --progress --verbose -c fast

This doesn't lock the master, and you don't have to rsync or call pg_start_backup() and friends.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

What to Do with WAL Files for Postgres Slave Reset

Related Question