Postgresql – Problems restoring backup from base backup + WAL with WAL-E

linuxpostgresqlpostgresql-9.1write-ahead-logging

I'm trying to do nightly restores of the main database on a separate node. I want to do weekly full restores and nightly restores by consuming the latest WAL files.

On the master I have

wal_level = hot_standby
archive_command = 'wal-e --s3-prefix=s3://... wal-push %p'

and I've done a full backup with:

wal-e --s3-prefix s3://... backup-push <pg data>

On the backup node I have done:

PGDATA=.. initdb
wal-e backup-fetch $PGDATA LATEST

which works and I'm able to connect and select data. Nice.

So to get it to consume the latest WAL archives I added a $PGDATA/recovery.conf:

restore_command = 'wal-e --s3-prefix ... wal-fetch "%f" "%p"'
standby_mode = 'off'

When start the server I expected it to go ahead and look for WAL segments to consume before starting in non-recovery mode, but I'm getting this:

LOG:  database system was interrupted; last known up at 2015-03-21 12:20:50 CET
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2015-03-21T12:16:38.820461-00 pid=23630 action=wal-fetch key=s3://.../wal_005/00000005.history.lzo prefix=...
/ seg=00000005.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://.../wal_005/00000005.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2015-03-21T12:16:39.829223-00 pid=23630
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2015-03-21T12:16:39.830287-00 pid=23630 action=wal-fetch key=s3://.../wal_005/00000005.history.lzo prefix=.../ seg=00000005.history state=complete
LOG:  starting archive recovery
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2015-03-21T12:16:39.900355-00 pid=23638 action=wal-fetch key=s3://.../wal_005/00000005000000050000009D.lzo prefix=.../ seg=00000005000000050000009D state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://.../wal_005/00000005000000050000009D.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2015-03-21T12:16:40.896215-00 pid=23638
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2015-03-21T12:16:40.899320-00 pid=23638 action=wal-fetch key=s3://.../wal_005/00000005000000050000009D.lzo prefix=.../ seg=00000005000000050000009D state=complete
WARNING:  WAL was generated with wal_level=minimal, data may be missing
HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
LOG:  consistent recovery state reached at 5/9D009DA8
LOG:  record with zero length at 5/9D009DA8
LOG:  redo is not required

So the questions are;

is my approach fundamentally wrong? Is there an alternative way to get this to work?

PostgreSQL is claiming I'm having wal_level=minimal, how can I check the level of the produced WAL files? On the main server I get this when checking it's actual settings:

select * from pg_settings where name='wal_level';
-[ RECORD 1 ]------------------------------------------------
name       | wal_level
setting    | hot_standby
unit       |
category   | Write-Ahead Log / Settings
short_desc | Set the level of information written to the WAL.
extra_desc |
context    | postmaster
vartype    | enum
source     | configuration file
min_val    |
max_val    |
enumvals   | {minimal,archive,hot_standby}
boot_val   | minimal
reset_val  | hot_standby
sourcefile | /.../postgresql.conf
sourceline | 155

EDIT: adding details on archive_command on master

Best Answer

wal-e backup push

will NOT push any wal files to s3.

So this backup will be mostly unrecoverable.

Use

archive_command = '... wal-e wal-push %p'

on master to make this backup usable.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql – Streaming Replication in PostgreSQL

PostgreSQL replicas never finish recovering. This is by design. Basically a replica is always in "recovering from disaster" mode except that it is using receiving the WAL segments from the master rather than on disk.

So what you are seeing is not cause for concern. If it is not working yet, then you will need to provide a more detailed description of what you are trying to do and what is not working. But as far as you are posting it seems normal.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql – Streaming Replication in PostgreSQL

Related Question