PostgreSQL 9.1 Hot Backup Error: the database system is starting up

backuppostgresql

I have been working on a hot backup for Postgres 9.1 for awhile and have run into a consistent issue. After restarting Postgres on the slave server the pgstartup log file and the daily log file under the pg_log directory reads with no errors. However, when I try to enter into the database using the psql command, I get the error:

FATAL: the database system is starting up.

The recovery.conf file also does not turn to recovery.done. I have extensively researched this error and consistently find the same response: the database has not been cleanly shut down before I tried to restart Postgres. The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database and still receive the same error. I am at a loss of where to go from here and how to fix this issue. Below is the exact process that I have done to complete the hot backup.

Master Server Configurations:

pg_hba.conf, added the line:

host replication postgres IPAddressOfSlaveServer trust

postgresql.conf:

wal_level = hot_standby
max_wal_senders = 5
listen_address = '*'
port = 5432
max_wal_senders = 5
wal_keep_segments = 32

Slave Server Configurations:

postgresql.conf:

hot_standby = on

recovery.conf:

standby_mode = on
primary_conninfo = host=IPAddressOfMasterServer
port = 5432
user = postgres
restore_command = 'cp /var/lib/pgsql/9.1/data/pg_xlog/%f "%p"'

After configuring both servers

I change to the postgres user on the master server and run the commands:

psql -c "Select pg_start_backup('label',true);";
rsync -a -v -e ssh /var/lib/pgsql/9.1/data slave:/var/lib/pgsql/9.1/data \
        --exclude postmaster.pid
pgsql -c "select pg_stop_backup();";

After syncing database with slave server

I restart the slave server and the start up does not fail. The pgstartup.log reads:

Success. You can now start the database server using:

    /usr/pgsql-9.1/bin/postgres -D /var/lib/pgsql/9.1/data
or
    /usr/pgsql/9.1/bin/pg_ctl -D /var/lib/pgsql/9.1/data -l logfile start

the current day log file, postgresql-Thu.log, reads:

Log: shutting down
Log: Database system is shut down
Log: database system was shut down in recovery at 2012-4-10
Log: entering standby mode
Log: restored log file "logFileName" from archive
Log: consistent recovery state reached at 0/BF0000B0
Log: redo starts at 0/BF000020
Log: restored log file "logFileName" from archive
Log: unexpected pageaddr 0/85000000 in log file 0, segment 192, offset 0
Log: unexpected pageaddr 0/85000000 in log file 0, segment 192, offset 0
Log: streaming replication successfully connected to primary

I researched unexpected pageaddr and from the postgres archives, it is my understanding that it is quite normal and one of the expected ways to detect end-of-WAL.

Any advice would be greatly appreciated.

Best Answer

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?