PostgreSQL Backup – Missing WAL File on Standby Server Following a Backup

postgresqlpostgresql-9.2

I seem to have an intermittent problem with my postgres backup/restore to standby

Postgres, is version 9.2 running on red hat 6 enterprise

i have a production server (PROD1) and a hot standby (STANDBY1), on sunday evening i have a script which puts
PROD1 into backup mode, performs a tar on the $PGDATA directory, copies the file to STANDBY1, a script on STANDBY1
is then called to stop postgres on STANDBY1, unpack the tar, and start postgres again.

In the meantime the archive_command on PROD1 is continuing to copy WAL files over the STANDBY1, so in theory STANDBY1
will be brought up to date. Then for the remainder of the week the WAL files are applied until the next Sunday when the
process starts again.

code extract for backup script on PROD1

psql -X -U postgres --set ON_ERROR_STOP=on <<PSQL
select pg_start_backup('${WEEK_START}_DataBackup','true');
PSQL

tar -zcf $WEEKLYBACKUP_DIRECTORY/full_backup_${WEEK_START}.tar.gz --exclude=$PGDATA/pg_xlog/* --exclude=$PGDATA/pg_xlogArchive/* 
--exclude=$PGDATA/pg_xlogRestored --exclude=$PGADTA/postmaster.pid --exclude=$PGDATA/postmaster.opts $PGDATA/

psql -X -U postgres --set ON_ERROR_STOP=on <<PSQL
select pg_stop_backup();
PSQL

scp -p $WEEKLYBACKUP_DIRECTORY/full_backup_${WEEK_START}.tar.gz $DR_SERVER:$WEEKLYBACKUP_DIRECTORY/.

ssh -T $DR_SERVER <<EOF >> $MAILFILE
  cd /var/lib/pgsql/Scripts
  ./postgres-hotStandby $WEEK_START Y
EOF

code extract for hotstandby script on STANDBY1

pg_ctl stop -D $PGDATA -m i >> $LOGFILE
tar -zxf $WEEKLYBACKUP_DIRECTORY/full_backup_${RESTORE_WEEK}.tar.gz

# create a recovery file
echo "standby_mode = 'on'" >$PGDATA/recovery.conf
echo "primary_conninfo = 'host=${LIVE_SERVER}'" >>$PGDATA/recovery.conf
echo "restore_command = 'cp -f $PGDATA/pg_xlog/%f $PGDATA/pg_xlogRestored'" >>$PGDATA/recovery.conf
echo "trigger_file = '${PGDATA}/failover'" >>$PGDATA/recovery.conf
sed -i 's/#hot_standby = off/hot_standby = on/g' $PGDATA/postgresql.conf
pg_ctl -w -D $PGDATA start

the above process works every second time it runs and fails the other times, giving the message

"psql: FATAL: the database system is starting up"

when i try and
get into psql.

looking in the log file on STANDBY1, postgres is awaiting a WAL file which it says…

cp: cannot stat
`/vol2/postgres92/pgsql/data/pg_xlog/000000010000005200000003': No
such file or directory 2015-04-24 00:10:40 EDT -LOG: streaming
replication successfully connected to primary 2015-04-24 00:10:40
EDT -FATAL: could not receive data from WAL stream: FATAL: requested
WAL segment 000000010000005200000003 has already been removed

when pg_start_backup is run, this creates a WAL file which is shipped to STANDBY1 (eg. 000000010000005200000003), and when pg_stop_backup is run this
creates a end of backup WAL (eg. 000000010000005200000003.00000020.backup), which is also correctly shipped to STANDBY1. But when the failure occurs
the pg_start_backup WAL seems to go missing on the STANDBY1 server.

hopefully this all makes some sort of sense to someone, any ideas what is going wrong? I seem unable to reproduce this on my test systems, and i'm thinking it
may be a timing issue as due to the size and location of PROD1 and STANDBY1, the scp between the servers takes 15mins.

thanks for any help you can give

Best Answer

This part generating recovery.conf is suspicious:

echo "restore_command = 'cp -f $PGDATA/pg_xlog/%f $PGDATA/pg_xlogRestored'" >>$PGDATA/recovery.conf

It seems that the source and destination of this restore_command are reversed.

Compare yours to the example from the doc in Continuous archiving:

restore_command = 'cp /mnt/server/archivedir/%f %p'

The restore must copy from the archive directory (which typically is outside $PGDATA and entirely under the DBA's control) into $PGDATA/pg_xlog (as expanded by %p).

I'm assuming that $PGDATA/pg_xlogRestored is your archive directory since I don't see what else it could be. Personally I'd place it outside of $PGDATA by principle but that's not mandatory.