The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages
:
http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN
After the rsync, did you really run what you show?:
pgsql -c "select pg_stop_backup();";
Since there is, so far as I know, no pgsql
executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql
, because otherwise I don't see how the slave would have logged such success messages as:
Log: consistent recovery state reached at 0/BF0000B0
and:
Log: streaming replication successfully connected to primary
Did you try connecting to the slave at this point? What happened?
The "Success. You can now start..." message you mention is generated by initdb
, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:
The only ways I have restarted Postgres is through the service
postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands.
After I receive this error, I kill all processes and again try to
restart the database...
Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:
log_line_prefix = '[%m] %p %q<%u %d %r> '
The recovery.conf
script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?
Munin's formula for "device utilization" is (milliseconds spent doing I/O)/second, which assumes you can't do any I/O in parallel, so I'm not sure that this is a meaningful metric. However, you do clearly have a genuine performance issue here since replication can't stay caught up.
Is there a significant difference in the I/O subsystems between the servers?
The way I would approach this is to compare the I/O subsystems and then if there isn't a glaring difference in architecture, run some I/O benchmarking at a quiet time with a tool like bonnie++ or fio to narrow down the difference in performance.
Note that the slave likely has an equal write workload to the master because it has replay all the writes, plus it may have a considerable read workload (for reading any parts of the database that have to be updated) and might be less efficient for reads than the master because it has less RAM for caching. In this situation I would not necessarily expect the slave to be able to catch up because it's entirely possible for the master to generate work at a rate faster than the slave can consume it.
Best Answer
Community Wiki answer generated from a question comment by a_horse_with_no_name
Create symbolic links on the slave that match the mount points of the tablespaces on the master.