We have a PostgreSQL 9.4.9 production server which was replicating to a slave instance, but today I discovered that the instance is out of sync!
The obvious actions would be to recreate the slave, setup metrics and proper alarms for the replication activity, so we can effectively monitor the sync status between master and slave nodes.
But, as the sync failed, I'd like to first diagnose the problem and try to identify the root cause of it, as this would be the second time this happens in about 6 months.
Question: How to diagnose what failed in the replication process so it can be done in a better way this time?
Version specifics:
PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
From slave node, at /var/log/postgresql/postgresql-9.4-main.log
I can see:
2017-07-18 19:43:55 UTC [12816-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:43:55 UTC [12816-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:00 UTC [12817-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:00 UTC [12817-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:05 UTC [12821-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:05 UTC [12821-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:10 UTC [12825-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:10 UTC [12825-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:15 UTC [12826-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:15 UTC [12826-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
New Question: How can I see backward on where the actual problem appeared?
Master postgresql.conf
: https://pastebin.com/NJX5ku6m
Slave postgresql.conf
: https://pastebin.com/CUZcyazC
Slave recovery.conf
:
standby_mode = on
primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'
Best Answer
Based on this I'd say you didn't have enough
wal_keep_segments
on the master, weren't using a replication slot, and either hadhot_standby_feedback
off or had the connection drop for long enough for the master to remove needed WAL.And you are presumably not using WAL archiving (
archive_command
on master,restore_command
on replica) as fallback.So the master removed transaction logs the standby needed.
You'll need to re-create the standby. Then either:
Set the standby to use a replication slot and enable
hot_standby_feedback
; orenable
archive_command
andrestore_command