The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages
:
http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN
After the rsync, did you really run what you show?:
pgsql -c "select pg_stop_backup();";
Since there is, so far as I know, no pgsql
executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql
, because otherwise I don't see how the slave would have logged such success messages as:
Log: consistent recovery state reached at 0/BF0000B0
and:
Log: streaming replication successfully connected to primary
Did you try connecting to the slave at this point? What happened?
The "Success. You can now start..." message you mention is generated by initdb
, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:
The only ways I have restarted Postgres is through the service
postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands.
After I receive this error, I kill all processes and again try to
restart the database...
Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:
log_line_prefix = '[%m] %p %q<%u %d %r> '
The recovery.conf
script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?
In the msdb database, there is a table named restorehistory:
declare @DB sysname = 'MyDB';
select * from msdb.dbo.restorehistory where destination_database_name = @DB;
This is a table that people with sufficient privileges can clear out, but if the restore was recent and you don't have a job which clears out this table (or if it hasn't run between then and now), you should be able to see the login which performed the restore.
EDIT
You can also join it to a couple of other tables (if they have relevant data): backupset and backupmediafamily. If there are records there--with the same retention caveats as before--they'll tell you more about the backup file(s) used to restore:
declare @DB sysname = 'MyDB';
select
rh.destination_database_name,
rh.user_name,
bs.name as backup_set_name,
bs.user_name as backup_set_username,
bs.backup_start_date,
bs.backup_finish_date,
bs.database_name as backup_set_database_name,
bs.server_name,
bs.machine_name,
bmf.physical_device_name,
bmf.device_type,
case bmf.device_type
when 2 then 'Disk'
when 5 then 'Tape'
when 7 then 'Virtual Device'
when 105 then 'Permanent backup device'
else 'UNKNOWN'
end as device_type_desc
from
msdb.dbo.restorehistory rh
left outer join msdb.dbo.backupset bs on rh.backup_set_id = bs.backup_set_id
left outer join msdb.dbo.backupmediafamily bmf on bs.media_set_id = bmf.media_set_id
where
rh.destination_database_name = @DB;
That way, you could also filter on device type or file location.
Best Answer
The answer will contain two sections - first, what's acceptable to see in the logs after the restore, and second a few examples of what is not. The first section should be fairly deterministic, while the second one is basically a random assortment of whatever happened to us that indicated we had a problem.
Acceptable log output
at the start:
It's important to see that the restoring PostgreSQL knows when it was last up. I think that's so because that means it's starting from a checkpoint.
xlog min recovery request ... is past current point
Right at the beginning, a few of these can happen:
But according to http://www.postgresql.org/message-id/CAB7nPqTd43hqpuC+M8fo+xkqHv1WtFe_16NUttu1pHcBtZhZmw@mail.gmail.com that is harmless
FATAL: the database system is starting up
Any number of these can happen:
This should actually be harmless because they were in our case the result of automated
SELECT 1
ping-like queries that scripts run to check that PostgreSQL is ready.unexpected pageaddr ... in log file ..., segment ..., offset ...
At the end, there's this:
But according to http://www.postgresql.org/message-id/CAGrpgQ-BbXUNErrAtToYhRyUef9_GdUQz1T3CXbpTMLTnuKANQ@mail.gmail.com that's also harmless
Note that there may be more of the WAL restorations after that point:
That would merely mean that you supplied more WAL files via
recovery.conf
than strictly necessary.00000002.history: No such file
At the very end of the WAL unroll process there's this:
This is apparently/hopefully irrelevant, because that's where the restored database (clone) starts a new life (timeline).
Unacceptable log output
at the start:
This is critical - it means that the backup process did not start at the right time - after a
pg_start_backup(...)
checkpoint - rather that the database was working normally and was at some random point, which means that this restore is more akin to restoring a crashed database.missing chunk in pg_toast...
This indicates that the restore wasn't right. As a quick fix, we tried the recipe from http://postgresql.nabble.com/select-table-indicate-missing-chunk-number-0-for-toast-value-96635-in-pg-toast-2619-td5682176.html
This could sometimes get the table back in a working state, but it would also sometimes not have that effect. After that we poked at it some more, and thought we found it's just pg_statistic which is disposable:
right sibling's left-link doesn't match
We tried to quickly bypass this by doing:
could not read block in file ...
This was obviously a bummer. We couldn't quickly hack our way around this:
That was a good indicator that too much data was getting "lost".
duplicate key value violates unique constraint "pg_type_typname_nsp_index"
This was another indicator that the restore was broken:
The quick hack for this was to move the sequence position: