The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages
:
http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN
After the rsync, did you really run what you show?:
pgsql -c "select pg_stop_backup();";
Since there is, so far as I know, no pgsql
executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql
, because otherwise I don't see how the slave would have logged such success messages as:
Log: consistent recovery state reached at 0/BF0000B0
and:
Log: streaming replication successfully connected to primary
Did you try connecting to the slave at this point? What happened?
The "Success. You can now start..." message you mention is generated by initdb
, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:
The only ways I have restarted Postgres is through the service
postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands.
After I receive this error, I kill all processes and again try to
restart the database...
Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:
log_line_prefix = '[%m] %p %q<%u %d %r> '
The recovery.conf
script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?
It's the file size that's the problem.
Essentially there's a timeout expiring somewhere, the server logs will tell you where.
When the timeout expires, no more data is produced, and after a while with no data the connection gets closed as per normal.
The last time I had to tackle something like this (Which is a long time ago now), I wrote a small program that basically pulled the data out of my server in chunks using various SQL criteria.
This is absolutely NOT the best way of doing it, but when you have a table that's so big and takes so long that things start to time out, it's one of the only ways I found to make it work.
The other approach you have is to use an ETL tool such as Pentaho and do a server to server transfer, again using a custom query to do it a chunk at a time.
If you MUST do it using PSQL, then try to break it up. Instead of one massive 150gb file, try doing the xfer a table at a time or similar.
Also, think about local resources. You might get better traction if for example, you can dump the file to a server where you have file system access that's inside amazons network, such as an AWS instance. If you have a local resource, set up an AT job, dump the DB to that instance (optionally G-zipping it in the process) then transfer that file to your PC using a web browser of FTP connection.
Tackling the initial problem head on, your not going to know which approach you have to take until you can figure out why it's timing out.
Best Answer
Server reboot helped. Privileges are treated as expected (Not sure if only DB engine is enough to fix that thought - I had to reboot entire machine)
Thanks @Colin for tip