Postgresql – Data Corruption on base backups created with wal-e

postgresqlpostgresql-9.1replication

I have streaming replication setup on Postgresql 9.1 with two machines on AWS EC2. The master has a cron job that does a daily base backup using the wal-e tool and uploading it to Amazon's S3 storage. The WAL files are also archived to S3 by using the same tool.

At some point, when I had to replace the slave instance, I ran a wal-e based script that loads the base backup from S3 and writes it to $PGDATA. After that when I started up the database, I got the following errors:

...
Apr 10 10:15:29 dbc postgres[25865]: [12-1] 2014-04-10 10:15:29 UTC LOG: restored log file "00000005000000B800000089" from archive
Apr 10 10:15:31 dbc postgres[25865]: [13-1] 2014-04-10 10:15:31 UTC LOG: restored log file "00000005000000B80000008A" from archive
Apr 10 10:15:33 dbc postgres[25865]: [14-1] 2014-04-10 10:15:33 UTC LOG: restored log file "00000005000000B80000008B" from archive
Apr 10 10:15:47 dbc postgres[25865]: [15-1] 2014-04-10 10:15:47 UTC LOG: restored log file "00000005000000B80000008C" from archive
Apr 10 10:15:47 dbc postgres[25865]: [16-1] 2014-04-10 10:15:47 UTC FATAL: could not access status of transaction 0
Apr 10 10:15:47 dbc postgres[25865]: [16-2] 2014-04-10 10:15:47 UTC DETAIL: Could not open file "pg_subtrans/2214": No such file or directory.
Apr 10 10:15:47 dbc postgres[25865]: [16-3] 2014-04-10 10:15:47 UTC CONTEXT: xlog redo hot_update: rel 1663/24577/5279107; tid 22/32; new 22/84
Apr 10 10:15:47 dbc postgres[25863]: [1-1] 2014-04-10 10:15:47 UTC LOG: startup process (PID 25865) exited with exit code 1
Apr 10 10:15:47 dbc postgres[25863]: [2-1] 2014-04-10 10:15:47 UTC LOG: terminating any other active server processes

The server does not start up.

Does this mean that my base backup is corrupted? Or does it mean that some WAL files are corrupted?

The master server is running Postgres 9.1.12 (Debian Wheezy), but the newly launched slave is running Postgres 9.1.13 (Debian Wheezy) – this happened because when installing the new machine, only the minor upgraded version of Postgres 9.1 was available in the repositories. My understanding is that this should not be a cause of the problem.

Interestingly, if I do an rsync of the master's data directory to the slave, and then start the slave with streaming replication setup, the slave comes up fine.

Because of this it is not clear if I have a corruption in my DB data directory (which is on an AWS EBS drive) or if there is something wrong with the backup tool wal-e. Any help is appreciated.

Best Answer

I've experienced this problem with wal-e up to 0.9.2. Manually creating the pg_subtrans directory seems to have got replication working again.