PostgreSQL – Promoting Warm Standby to Master Using Streaming Replication

backuphigh-availabilitypostgresqlreplication

I've currently got 2 database servers running Centos 6 and Postgresql 9.4 (identical clones on the same network). They are set up as a primary and a backup created using pg_basebackup from the primary, with the primary copying WAL to the secondary's file system and the backup set to update from the WAL and connect to the primary using streaming replication.

This all works fine up to the point that I kill the master and try to promote the backup to be the new master. I'm running kill -9 to simulate a catastrophic failure of the master and then touching the trigger file on the backup shortly afterwards.

The problem is that as soon as I touch the trigger file on the backup server it detects it and then fails to start up as the primary.

Here are the logs from the backup at the point I disconnect it from the primary:

< 2015-10-05 14:20:52.888 BST >LOG:  trigger file found: /tmp/psql.trigger
< 2015-10-05 14:20:52.888 BST >LOG:  redo is not required
< 2015-10-05 14:20:52.904 BST >LOG:  restored log file "000000010000000000000004" from archive
< 2015-10-05 14:20:52.906 BST >PANIC:  record with zero length at 0/40000C8
< 2015-10-05 14:20:52.906 BST >LOG:  startup process (PID 3101) was terminated by signal 6: Aborted
< 2015-10-05 14:20:52.906 BST >LOG:  terminating any other active server processes
< 2015-10-05 14:21:08.580 BST >LOG:  database system was interrupted while in recovery at log time 2015-10-05 13:50:08 BST
< 2015-10-05 14:21:08.580 BST >HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
< 2015-10-05 14:21:08.935 BST >LOG:  entering standby mode
< 2015-10-05 14:21:08.947 BST >LOG:  restored log file "000000010000000000000004" from archive
< 2015-10-05 14:21:08.949 BST >LOG:  record with zero length at 0/40000C8
< 2015-10-05 14:21:08.951 BST >FATAL:  could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host "192.168.56.101" and accepting
        TCP/IP connections on port 5432?

The logs at 13:50 were just the server starting up in recovery mode.

The config in postgresql.conf on the primary is:

listen_addresses = '*'
wal_level = archive
max_wal_senders = 2
archive_mode = on
archive_command = 'rsync -a %p postgres@192.168.56.102:/apps/pgsql/9.4/write_ahead_logs/%f'
archive_timeout = 600

The config in recovery.conf on the backup is:

standby_mode = 'on'
primary_conninfo = 'host=192.168.56.101 port=5432 user=backups password=abc'
restore_command = 'cp /apps/pgsql/9.4/write_ahead_logs/%f %p'
archive_cleanup_command = 'pg_archivecleanup /apps/pgsql/9.4/write_ahead_logs %r'
trigger_file = '/tmp/psql.trigger'

I'm sure there's something important I'm missing but after searching the internet and the docs I can't put my finger on it. Any help would be much appreciated.

Best Answer

Solved it. The issue was that I had Monit running in the background which was picking up the backup database before I had written recovery.conf (the servers are being provisioned by chef).

The solution was to pass the -R flag to pg_basebackup in order to ensure the database was never started outside of recovery mode which caused it to be incompatible with the WALs and streaming replication from the master.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql – Streaming Replication in PostgreSQL

PostgreSQL replicas never finish recovering. This is by design. Basically a replica is always in "recovering from disaster" mode except that it is using receiving the WAL segments from the master rather than on disk.

So what you are seeing is not cause for concern. If it is not working yet, then you will need to provide a more detailed description of what you are trying to do and what is not working. But as far as you are posting it seems normal.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql – Streaming Replication in PostgreSQL

Related Question