Postgresql – Cannot get the slave – replication server to start

postgresqlreplication

My replication server will not start. I have followed the instructions here:
http://opensourcedbms.com/dbms/setup-replication-with-postgres-9-2-on-centos-6redhat-el6fedora/

As well as several other places including the Postgres Wiki as they all have the same information.

Here is what happens:
I do a full backup of the /9.2/data folder and move it to the replication/slave server, untar it. I can start PostgreSQL as well as pgAdmin and access all data with no problems.

I then move to the instructions on editing the pg_hba.conf and postgresql.conf for the slave server. I attempt to start it, and it fails (error in red [fail]. I cannot find any logs anywhere to give me a hint as to why.

I even verified there was no postmaster.pid in the data folder.

Also, I cannot find any log files. Do I need to "activate" a log file in the configuration?

So, if anyone wants to take a stab in the dark on my vague description, I'd love to hear any suggestions. I can put my conf files on pastebin if that will help.

Best Answer

Steps to set up the hot standby

First, determine the mount points of the Postgresql data directories and the WAL (Write Ahead Log) directories. For performance, the $PGDATA and pg_xlog directories should be on separate volumes.

In the examples below, they are defined as follows:

  • $PGDATA is the "master" cluster, located at /pgdata/9.3/data, at (for example) 192.168.10.0
  • The "slave" (aka "hot standby") will be (for example) located at /pgdata/9.3/data, at (for example) 192.168.10.1
  • The WAL staging area on the slave will be /pgdata/WAL_Archive. This is where the WAL segments are sent to from the archive_command on the master. Ideally on a separate volume from $PGDATA.

Assumptions:

  1. The master cluster is up and running normally.
  2. There is no slave cluster yet.
  3. You are doing all operations as the "postgres" OS user (not root!)
  4. ssh between both servers is working
  5. You are using at least PostgreSQL 9.2
  6. pg_hba.conf and postgresql.conf are located at $PGDATA/. If not, change the instructions below to match your location.
  7. You are setting up the slave on a separate server.
  8. If the servers are not on the same network (eg. different colocations), use the "-z" flag in the "archive_command" rsync on the master to compress before transferring.
  9. These commands are for Linux, but with some substitution of commands should work on Windows too.

On the slave, create the staging directory to hold the master's log files

mkdir -p /pgdata/WAL_Archive
chown postgres:postgres /pgdata/WAL_Archive

On the master, edit the $PGDATA/postgresql.conf

wal_level = hot_standby
archive_mode = on
                    ## /pgdata/WAL_Archive is a staging directory on the slave
archive_command = 'rsync -W -az %p postgres@$SLAVE_IP_HERE:/pgdata/WAL_Archive/'
max_wal_senders = 5
wal_keep_segments = 5000   # If you have the room, to help the pg_basebackup
                           # not fail due to WAL segments being removed from master.
                           # For clusters will very little traffic, 100 is probably fine

On the master, create the replication role, which will be copied to the slave via pg_basebackup. Set a password for the "replication" role

psql -U postgres -d postgres -c "CREATE USER replication WITH replication ENCRYPTED PASSWORD 'changeme' LOGIN"

Modify the master $PGDATA/pg_hba.conf and enable the replication user for the IP of the slave

# TYPE DATABASE USER ADDRESS METHOD
#hostssl    replication     replication     $SLAVE_IP_HERE/32       md5
host    replication     replication     $SLAVE_IP_HERE/32       md5

Restart the master cluster to pick up the changes to the postgresql.conf. This is done as the cluster superuser.

For example:

pg_ctl -D $PGDATA restart -m fast
## The master cluster MUST be restarted before the pg_basebackup command is executed.

On the slave, from $HOME, issue the pg_basebackup command to start setting up the hot standby from the master.

## --host=IP_OF_MASTER  -> The master's IP
## --pgdata=$PGDATA     -> The slave's $PGDATA directory
## --xlog-method=stream -> Opens a second connection to the master to stream the WAL segments rather than pulling them all at the end
## --password will prompt for the replication role's password
## Without compression, "stream" gets the changes via the same method as Streaming Replication

time pg_basebackup --pgdata=$PGDATA --host=IP_OF_MASTER --port=5432 --username=replication --password --xlog-method=stream --format=plain --progress --verbose

## Alternate version with compression, note "--xlog --gzip --format=tar"
#time pg_basebackup --pgdata=$PGDATA --host=IP_OF_MASTER --port=5432 --username=replication --password --xlog --gzip --format=tar --progress --verbose

On the slave, after pg_basebackup has successfully completed, edit $PGDATA/postgresql.conf

hot_standby = on #off               # "on" allows queries during recovery
max_standby_archive_delay = 15min   # max delay before canceling queries,
                                    # set to hours if backups will be taken from here
max_standby_streaming_delay = 15min # max delay before canceling queries
hot_standby_feedback = on #off

On the slave, create $PGDATA/recovery.conf:

standby_mode = on

## To promote the slave to a live database, issue "touch /tmp/promote_db"
## Warning: If multiple slaves share the same /tmp/ directory,
##          then the trigger file must be named uniquely, else multiple slaves
##          could attempt to be promoted in the presence of the trigger file.
trigger_file = '/tmp/promote_db_slave'

## Host can be the master's IP or hostname
primary_conninfo = 'host=IP_OF_MASTER port=5432 user=replication password=CHANGEME'

## Log the standby WAL segments applied to a standby.log file
## TODO: Add the standby.log to a log rotator
## The paths must be explicitly defined, including the path to pg_archivecleanup
restore_command = 'cp /pgdata/WAL_Archive/%f "%p" 2>>/pgdata/9.3/data/pg_log/standby.log'

## XXX: If multiple slaves share the staging WAL directory,
##      do not use pg_archivecleanup as WAL segments could be removed
##      before being applied to other slaves.
archive_cleanup_command = '/usr/pgsql-9.3/bin/pg_archivecleanup /pgdata/WAL_Archive %r'

## On hot standby clusters, set to 'latest' to switch to the newest timeline in the archive
recovery_target_timeline = 'latest'

The slave should be ready to be started now. Start the cluster on the slave with whatever method works best.

pg_ctl -D $PGDATA start

The following are the commands I run when recreating a standby over and over for test. I add them to a script called /root/recreate_standby.sh. Runs as root, but that is not necessary - if you run as postgres, removed the "sudo su - postgres -c" command.

#!/bin/bash

## This script runs on the standby.
## Executed as root, else remove the "sudo - postgres -c" commands.
## Assumes you have a valid recovery.conf saved at
## $PGDATA/../recovery.conf.bkp

export PGDATA=/path/to/data/dir     ## Must be set correctly
export PGPORT=5432
export MASTER=192.168.x.x            ## IP or host entry for the master Postgresql server
export PGBIN=/usr/pgsql-9.3/bin

service postgresql-9.3 stop -m immediate

if [ $? != 0 ]; then
    service postgresql-9.3 start
    echo "Could not shut down PostgreSQL. Aborting."
    exit 1
fi

rm -rf $PGDATA

if [ $? != 0 ]; then
    echo "Could not remove the PostgreSQL $PGDATA dir. Aborting."
    exit 1
fi

## If the replication role is not set to "trust" in the master's
## pg_hba.conf file, the password will need to be passed into the command below,
## and "--no-password" will need to be removed or revised to be "--password"
su - postgres -c "$PGBIN/pg_basebackup --pgdata=$PGDATA --host=$MASTER --port=$PGPORT --username=replication --no-password --xlog-method=stream --format=plain --progress --verbose"

su - postgres -c "cp -p $PGDATA/../recovery.conf.bkp $PGDATA/recovery.conf"

service postgresql-9.3 start

su - postgres -c "$PGBIN/pg_isready -U postgres -p $PGPORT -t2"

while [ $? != 0 ]; do
  echo "Sleep 1 second, check if slave is up yet. If not, sleep again."
  sleep 1;
  su - postgres -c "$PGBIN/pg_isready -U postgres -p $PGPORT -t2"
done

su - postgres -c "$PGBIN/psql -d postgres -U postgres -qXc 'select pg_is_in_recovery() as is_pg_in_recovery'"

exit 0

For more in-depth details, current Postgresql docs: