PostgreSQL – How to Determine if Hot Standby is Fully Mirrored

postgresqlpostgresql-9.2replication

I have set up a hot standby of a PostgreSQL server. It all seems to be working, but I just want to be sure that I'm not missing something. In /var/lib/pgsql/9.2/data/pg_log/postgresql-Wed.log I have the following:

LOG:  creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `/var/lib/pgsql/9.2/wal/00000002.history': No such file or directory
LOG:  entering standby mode
cp: cannot stat `/var/lib/pgsql/9.2/wal/0000000200000031000000B4': No such file or directory
LOG:  streaming replication successfully connected to primary
LOG:  redo starts at 31/B47BFAC0
LOG:  consistent recovery state reached at 31/B73624A0
LOG:  database system is ready to accept read only connections

I am concerned about the missing WAL files. Can anyone confirm that, as long as it reaches a consistent state, the hot standby contains all the data of the master?

Everything else I check indicates that it's ok; for example, running psql -x -c "select * from pg_stat_replication;" on the master looks good, and adding a new record on the master replicates. I just want to be sure that there won't be anything missing from the slave.

Best Answer

I think this is normal and expected if your restore_command is set to something like this example:

restore_command = 'cp /mnt/server/archivedir/%f "%p"'

The manual says that:

At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

So you can expect to see exactly one restore_command failure when you start your standby, because PostgreSQL will keep calling it (with incrementing log file names/numbers) until it fails once.

Then it will connect to the primary and start streaming as described above, and as you saw in your logs:

LOG:  streaming replication successfully connected to primary

The slave is not guaranteed to be exactly up-to-date with the master, because it could be disconnected from the master for example. In particular, this line:

LOG:  consistent recovery state reached at 31/B73624A0

does not mean that "the hot standby contains all the data of the master". However, if you see it followed by this line, as you did:

LOG:  database system is ready to accept read only connections

then the database is "ready enough" to start functioning as a read-only standby, as the manual says:

It may take some time for Hot Standby connections to be allowed, because the server will not accept connections until it has completed sufficient recovery to provide a consistent state against which queries can run. During this period, clients that attempt to connect will be refused with an error message.

In my case, I saw consistent recovery state reached not followed by database system is ready to accept read only connections. This turned out to be a problem with an embedded scripting language plugin (plpython2) having a system-wide startup script (sitecustomize.py) which did bad things to the PostgreSQL process (enabling faulthandler and installing a signal handler for SIGUSR2) which caused it to never enter hot standby mode.