PostgreSQL 9.6 log shipping replication stuck on startup

postgresqlreplicationwrite-ahead-logging

Overview

We have been trying to restore replication between a Master and Replication server for a while, but the individual who had set up has left and we've been working at it for a while. In brief, the issue seems to be that the Replication server starts up and within a few minutes reports postgres: startup process recovering [FILENAME] and never advances beyond that same FILENAME.

Configuration

Master

pg_hba.conf

local   replication             replicator                      trust
host    replication             replicator    10.1.20.181/32    trust

postgresql.conf

(only active settings related to replication)

wal_level = hot_standby         # minimal, archive, hot_standby, or logical
fsync = on                      # turns forced synchronization on or off
wal_sync_method = fdatasync     # the default is the first option
archive_mode = on               # allows archiving to be done
archive_command = '/bin/true'   # Temporary until replication is working
archive_timeout = 300           # force a logfile segment switch after this
max_wal_senders = 5             # max number of walsender processes
wal_keep_segments = 600         # in logfile segments, 16MB each; 0 disables
max_replication_slots = 3       # max number of replication slots

Replication Server

postgresql.conf

hot_standby = on 

recovery.conf

standby_mode = 'on' # enables stand-by (readonly) mode

# Connect to the master postgres server using the replicator user we created.
primary_conninfo = 'host=10.1.20.180 port=5432 user=replicator'
primary_slot_name = 'oh_slot'

# Specifies a trigger file whose presence should cause streaming replication to
# end (i.e., failover).
trigger_file = '/tmp/pg_failover_trigger'  

Replication Set up Process

On the Master, we clear out all old WAL files by using pg_controldata to find the current WAL file and delete anything older

On the Replication Server:

  1. Stop the database
  2. Unmount pg_xlog directory
  3. Delete contents of data directory
  4. Run pg_basebackup -Xs -U replicator -h #{SOURCE_DB} -D /var/lib/postgresql/9.6/main -v
  5. Copy recovery.conf into new data directory
  6. Move the pg_xlog directory, create a new one and mount it, then copy the archive files over to the new directory and fix permissions
  7. Start up the database

At this point, I've seen the startup process process a number of WAL files, but it always (and rather quickly — like within a few minutes) stops and continually reports that it is recovering the same file indefinitely. Checking tables between the two databases confirms that the replication database hasn't added new records that are in the master database.

I get the feeling that I am missing something fundamental and would greatly appreciate questions and guidance.

Updates

Luan Huynh: I had the false impression that there were no errors upon startup, but the log has the following 2017-04-16 06:38:32 EDT [13029-1] FATAL: could not start WAL streaming: ERROR: replication slot "oh_slot" does not exist; so clearly I did not create the replication slot on the Master as I thought I had.

KookieMonster: I had found advice to set the archive_command to true as a debugging technique. $FILENAME is the last of the files in the pg_xlog directory as it had been copied during the pg_basebackup. The Master indeed has additional WAL files that were created after $FILENAME, but the error in the log seems to clearly indicate why the system has not been been moving over the WAL files created after the pg_basebackup.

I'll verify the replication slot on the master and update.

Best Answer

Thanks to Luan Huynh, I checked the logs on the Slave and saw that it was not able to start WAL streaming due to the missing replication slot on the Master (I really was certain it had been created! :-) )

Creating the replication slot with pg_create_physical_replication_slot was all that was needed. Verifying the replication slot with SELECT * FROM pg_replication_slots showed the slot had been created and WAL streaming was active.

I was able to verify that the latest records on highly active tables on the Master are being successfully replicated on the Slave.