Postgresql – Replica is stuck waiting for

postgresqlreplication

Running two PostgreSQL 9.2.3 instances. One is master and other is slave.

All was working for awhile, I could make changes on master and see them on the slave but now my slave is not getting changes any more.

1000     25069  0.2  0.1 131616 11280 pts/0    S    01:02   0:00 /usr/local/pgsql/bin/postmaster -D /mnt/pgdb/v9.2.3 -i
1000     25074  0.0  0.0 131724  1440 ?        Ss   01:02   0:00 postgres: startup process   waiting for 000000040000000700000083

I've tried countless things including completely blowing away the slave, running initdb and rsyncing the master back to the slave, launching the slave only to find it still stuck in this same dam state.

What the hell am I doing wrong here?

Slave's recovery.conf looks as follows:

standby_mode = 'on'
primary_conninfo = 'host=172.16.0.14'
trigger_file = '/tmp/pgfailover'
restore_command = 'cd .'

I've also tried the following on the master to no avail:

$PGDIR/psql -U $UNAME -d postgres -c "select pg_start_backup('clone',true);"
rsync -av --exclude pg_xlog --exclude postgresql.conf /mnt/pgdb/current/* 172.16.0.8:/mnt/pgdb/current/
$PGDIR/psql -U $UNAME -d postgres -c "select pg_stop_backup();"

UPDATE – Including relevant information pertaining to replicate problem.

postgresql.conf on the Master node

wal_level = hot_standby
archive_mode = on
archive_command = 'cd .'

max_wal_senders = 3
wal_keep_segments = 5000
hot_standby = on
hot_standby_feedback = on

The Slave Node's postgresql.conf has these settings:

wal_level = hot_standby
archive_mode = on
archive_command = 'cd .'

max_wal_senders = 3
hot_standby = on
hot_standby_feedback = on

Best Answer

It looks like the standby doesn't successfully connect to the master.

In order to be able to use the streaming replication, the master needs to have enough WALs in the streaming buffer that it can send to the standby. Since you're not using log shipping, it needs to store all the logs from when you started the backup to when the standby has been started.

You can control the number of WAL's stored in the buffer by setting wal_keep_segments to a high enough value.

You should also make sure that max_wal_senders is set to the number of standby servers you have.

However, it is always best to have the WAL restore set as a backup for when the streaming replication doesn't work for any reason. You can achieve that by setting the archive_command in postgresql.conf and the restore_command in recovery.conf

For instance like this, assuming that /pgsql/backups/archive_logs is mounted as a nfs share accessible on both servers:

archive_command = 'cp %p /pgsql/backups/archive_logs/%f'
restore_command = 'cp /pgsql/backups/archive_logs/%f %p'

Related Solutions

Postgresql – Automating failover in PostgreSQL 9.1

Check out repmrg:

repmgr is a set of open source tools that helps DBAs and System administrators manage a cluster of PostgreSQL databases..

By taking advantage of the Hot Standby capability introduced in PostgreSQL 9, repmgr greatly simplifies the process of setting up and managing database with high availability and scalability requirements.

repmgr simplifies administration and daily management, enhances productivity and reduces the overall costs of a PostgreSQL cluster by:

monitoring the replication process; allowing DBAs to issue high

availability operations such as switch-overs and fail-overs.

It does two things:

repmgr: command program that performs tasks on your cluster and then exits
repmgrd: management and monitoring daemon that watches the cluster and can automate remote actions.

For automatic failover, repmgrd does the trick and is not a SPOF in your network, like pgPool. However, it is still important to monitor all deamons and bring them back up after failure.

Version 2.0 is about to be released, including RPM's.

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Best Answer

Related Solutions

Postgresql – Automating failover in PostgreSQL 9.1

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Related Question