Post Mortem Analysis – PostgreSQL Replication Failure

data synchronizationmaster-slave-replicationpostgresqlpostgresql-9.4replication

We have a PostgreSQL 9.4.9 production server which was replicating to a slave instance, but today I discovered that the instance is out of sync!

The obvious actions would be to recreate the slave, setup metrics and proper alarms for the replication activity, so we can effectively monitor the sync status between master and slave nodes.

But, as the sync failed, I'd like to first diagnose the problem and try to identify the root cause of it, as this would be the second time this happens in about 6 months.

Question: How to diagnose what failed in the replication process so it can be done in a better way this time?

Version specifics:

PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

From slave node, at /var/log/postgresql/postgresql-9.4-main.log I can see:

2017-07-18 19:43:55 UTC [12816-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:43:55 UTC [12816-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:00 UTC [12817-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:00 UTC [12817-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:05 UTC [12821-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:05 UTC [12821-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:10 UTC [12825-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:10 UTC [12825-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:15 UTC [12826-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:15 UTC [12826-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

New Question: How can I see backward on where the actual problem appeared?

Master postgresql.conf: https://pastebin.com/NJX5ku6m

Slave postgresql.conf: https://pastebin.com/CUZcyazC

Slave recovery.conf:

standby_mode = on
primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'

Best Answer

Based on this I'd say you didn't have enough wal_keep_segments on the master, weren't using a replication slot, and either had hot_standby_feedback off or had the connection drop for long enough for the master to remove needed WAL.

And you are presumably not using WAL archiving (archive_command on master, restore_command on replica) as fallback.

So the master removed transaction logs the standby needed.

You'll need to re-create the standby. Then either:

  • Set the standby to use a replication slot and enable hot_standby_feedback; or

  • enable archive_command and restore_command