Postgresql streaming replication and wal archive

postgresql

We have postgresql 9.1 configured for streaming replication with a WAL archive, and this works great. We would like to use a shared WAL archive so we can restore from periodic base backups. The postgresql docs refer to this as a "long-term WAL archive area."

However, we archive to a shared folder (using archive_command), and after a failover, the new master tries to archive the same WAL segment filename as the old master. The contents are different. What is the right way to handle this?

We have tried to allow the new master to overwrite the old file, but then when we test restore from the base backup with a configured restore_command that restores from the shared WAL archive, we get various errors when the restoring server hits the overwritten log files. (errors like "record with zero length at 0/E5003128") So it looks like overwriting is the wrong approach. What is the right approach?

In a restore situation, there is no master and no slave. There is a fresh clean database that is restoring from a basebackup. It also has a restore_command. How are the archive_command and restore_command supposed to work together to let the restore work? (i.e. where does the archive command write after a failover, and where does the restore_command read from?)

A bit more information so this is perfectly clear:

Let's say we have 4 servers. A, B, X, and R. A and B are identical, and are a master/slave pair using streaming replication. They have the same configuration, except as far as recovery.conf v. recovery.done on the slave. X is a separate reliable host where we store backups and archived WAL files. archive_command (on either A or B, whoever is master) archives WAL files to a shared folder on X. restore_command (on either A or B, whoever is slave) restores from that shared folder on X. Periodically, we take a basebackup of the master and store it on X. This works, with one caveat below.

In a disaster situation (or for testing) we want to restore a full copy of the database onto R. We restore one of the basebackups from X on R. R now has the restore_command to read from the shared folder on X. It can easily restore all the WAL files made from the same source as the basebackup, but it fails as soon as it hits a WAL file from the new master after a failover. (see Caveat).

Caveat: if archive_command does as the specs says, and refuses to overwrite a file with the same name, then the new master will fail to archive any WAL file segments. On promotion, it tries to archive everything it has locally, including segments from the old timeline. In fact, it has segments that were never archived by the old master, because they were streamed before the xlog segment was closed. So currently, we allow the archive command to overwrite archived WAL files in this situation. It's these WAL files that fail to restore in a recovery. We know this is wrong, somehow, but don't know what the right behavior should be. With this behavior, though, failover and later fail back work.

Best Answer

Further testing revealed that the different WAL files were mostly those that had been rotated but not yet written to. The new slave would try to archive these (possibly fixed in 9.2) but they wouldn't be used for recovery.

The other problem is the overwriting. We disabled the overwrite (instead storing with a new filename so we at least keep the WAL just in case) and everything works... both failover and later restore.

NOTE: my earlier answer here that was completely wrong. See the edit log if you need to see how wrong.