PostgreSQL Replication delay debugging

postgresql-9.1replication

I've got a cluster of 9.1 servers running streaming replication and using WAL-E based log shipping for when that falls behind.

Generally speaking it's a pretty solid setup and does not fall behind, however we appear to have one slave in the cluster that consistently falls behind when replicating in the wee hours of the morning. It always catches up again shortly after but I have not yet been able understand why it does this.

There does not appear to be consistency with regards to the timing apart from it happens when our application is not handling a lot of load. Is it possible that I'm seeing this occur because of lack of data to be replicated?

Here's what the relevant portions of the postgresql.conf and recovery.conf look like

postgresql.conf

archive_mode = on
archive_command = '. /mnt/data/postgresql/.profile && wal-e wal-push %p'

recovery.conf

recovery_target_timeline = 'latest'
standby_mode = 'on'
primary_conninfo = 'host=master_db_ip_address port=5432 user=postgres'
restore_command = '. /mnt/data/postgresql/.profile && wal-e wal-fetch "%f" "%p"'

To add a little more context, you can see the replication delay for this server is creeping up as time goes on.

PostgreSQL replication delay

  1. This should not occur. The streaming replication delay should be < 1-2s most of the time (based on past experience with this cluster)
  2. There are no scheduled jobs running on the db cluster at the time these spikes in the replication timing occur.

What are some good strategies for debugging this kind of issue?

EDIT

I had it suggested to me that my methodology for computing the replication delay may be incorrect, here's what I'm using:

SELECT EXTRACT(EPOCH FROM NOW() - pg_last_xact_replay_timestamp())

Best Answer

Maybe you are experienceing what is described at the end of the following article:

http://www.niwi.be/2013/02/16/replication-status-in-postgresql/

In a very busy database, with many writes per second, this number will remain fairly accurate. However, in a system where there are few writes, the "replication_delay" will continually grow because the last replayed transaction timestamp isn't increasing (this is generally the same limitation as MySQL's SHOW SLAVE STATUS output).