I seem to have an intermittent problem with my postgres backup/restore to standby
Postgres, is version 9.2 running on red hat 6 enterprise
i have a production server (PROD1) and a hot standby (STANDBY1), on sunday evening i have a script which puts
PROD1 into backup mode, performs a tar on the $PGDATA directory, copies the file to STANDBY1, a script on STANDBY1
is then called to stop postgres on STANDBY1, unpack the tar, and start postgres again.
In the meantime the archive_command on PROD1 is continuing to copy WAL files over the STANDBY1, so in theory STANDBY1
will be brought up to date. Then for the remainder of the week the WAL files are applied until the next Sunday when the
process starts again.
code extract for backup script on PROD1
psql -X -U postgres --set ON_ERROR_STOP=on <<PSQL
select pg_start_backup('${WEEK_START}_DataBackup','true');
PSQL
tar -zcf $WEEKLYBACKUP_DIRECTORY/full_backup_${WEEK_START}.tar.gz --exclude=$PGDATA/pg_xlog/* --exclude=$PGDATA/pg_xlogArchive/*
--exclude=$PGDATA/pg_xlogRestored --exclude=$PGADTA/postmaster.pid --exclude=$PGDATA/postmaster.opts $PGDATA/
psql -X -U postgres --set ON_ERROR_STOP=on <<PSQL
select pg_stop_backup();
PSQL
scp -p $WEEKLYBACKUP_DIRECTORY/full_backup_${WEEK_START}.tar.gz $DR_SERVER:$WEEKLYBACKUP_DIRECTORY/.
ssh -T $DR_SERVER <<EOF >> $MAILFILE
cd /var/lib/pgsql/Scripts
./postgres-hotStandby $WEEK_START Y
EOF
code extract for hotstandby script on STANDBY1
pg_ctl stop -D $PGDATA -m i >> $LOGFILE
tar -zxf $WEEKLYBACKUP_DIRECTORY/full_backup_${RESTORE_WEEK}.tar.gz
# create a recovery file
echo "standby_mode = 'on'" >$PGDATA/recovery.conf
echo "primary_conninfo = 'host=${LIVE_SERVER}'" >>$PGDATA/recovery.conf
echo "restore_command = 'cp -f $PGDATA/pg_xlog/%f $PGDATA/pg_xlogRestored'" >>$PGDATA/recovery.conf
echo "trigger_file = '${PGDATA}/failover'" >>$PGDATA/recovery.conf
sed -i 's/#hot_standby = off/hot_standby = on/g' $PGDATA/postgresql.conf
pg_ctl -w -D $PGDATA start
the above process works every second time it runs and fails the other times, giving the message
"psql: FATAL: the database system is starting up"
when i try and
get into psql.
looking in the log file on STANDBY1, postgres is awaiting a WAL file which it says…
cp: cannot stat
`/vol2/postgres92/pgsql/data/pg_xlog/000000010000005200000003': No
such file or directory 2015-04-24 00:10:40 EDT -LOG: streaming
replication successfully connected to primary 2015-04-24 00:10:40
EDT -FATAL: could not receive data from WAL stream: FATAL: requested
WAL segment 000000010000005200000003 has already been removed
when pg_start_backup
is run, this creates a WAL file which is shipped to STANDBY1 (eg. 000000010000005200000003), and when pg_stop_backup
is run this
creates a end of backup WAL (eg. 000000010000005200000003.00000020.backup), which is also correctly shipped to STANDBY1. But when the failure occurs
the pg_start_backup WAL seems to go missing on the STANDBY1 server.
hopefully this all makes some sort of sense to someone, any ideas what is going wrong? I seem unable to reproduce this on my test systems, and i'm thinking it
may be a timing issue as due to the size and location of PROD1 and STANDBY1, the scp between the servers takes 15mins.
thanks for any help you can give
Best Answer
This part generating
recovery.conf
is suspicious:It seems that the source and destination of this
restore_command
are reversed.Compare yours to the example from the doc in Continuous archiving:
The restore must copy from the archive directory (which typically is outside
$PGDATA
and entirely under the DBA's control) into$PGDATA/pg_xlog
(as expanded by%p
).I'm assuming that
$PGDATA/pg_xlogRestored
is your archive directory since I don't see what else it could be. Personally I'd place it outside of$PGDATA
by principle but that's not mandatory.