The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages
:
http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN
After the rsync, did you really run what you show?:
pgsql -c "select pg_stop_backup();";
Since there is, so far as I know, no pgsql
executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql
, because otherwise I don't see how the slave would have logged such success messages as:
Log: consistent recovery state reached at 0/BF0000B0
and:
Log: streaming replication successfully connected to primary
Did you try connecting to the slave at this point? What happened?
The "Success. You can now start..." message you mention is generated by initdb
, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:
The only ways I have restarted Postgres is through the service
postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands.
After I receive this error, I kill all processes and again try to
restart the database...
Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:
log_line_prefix = '[%m] %p %q<%u %d %r> '
The recovery.conf
script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?
In order to restore a backup, you need to have the base archive of all the data files, plus a sequence of xlogs. An "incremental backup" can be made, of just some more xlogs in the sequence. Note that if you have any missing xlogs, then recovery will stop early.
So it's not clear here exactly what you've done, unless you changed the level of detail you're mentioning part way through your list. When you make a copy of more segments that have been put into the archive directory after adding more data, you need to ensure that all the data has been archived: using pg_start_backup
and pg_stop_backup
usually does this for you, but you don't mention it the second time. You need to at least do a pg_switch_xlog
to have the current xlog segment immediately archived.
If you think that recovery is not consuming enough xlog segments, look at the recovery log to see if it tried to take them all. And have your recovery command make some sort of mark on which xlog files were taken.
Best Answer
"pg_wal" cleans itself up. You should almost never touch pg_wal by hand. If it is not cleaning itself up, you need to figure out why and fix the underlying issue.
One possible reason is that you have a replication slot which is holding it back. Either a replica is using a slot and is unable to keep up. Or you have a slot which has no replica attached, for example you destroyed the replica but didn't drop the slop it used to occupy. You can see what slots you have by querying pg_replication_slots, and if necessary drop one with pg_drop_replication_slot, both run on the master. You would look for the slot with the oldest non-NULL value of "restart_lsn".
Another reason is that you have "archive_mode" turned on, but your "archive_command" is constantly failing or can't keep up. You will see warnings about this in your server log file if it is failing.
"pg_archivecleanup" is used to clean up a WAL archive. "pg_wal" is not the archive, it is the live WAL files. You are lucky you didn't destroy your database by monkeying around in there.