PostgreSQL WAL Archival Best Practices

backupdisaster recoverypostgresqlwrite-ahead-logging

The postgresql documentation gives an example archive_command of

archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'  # Unix

but adds the disclaimer saying

This is an example, not a recommendation

As someone who's about to setup a postgresql database, I'm wondering what the best practice would be for handling WAL archival? Forgive me if this is already a question that's been beaten to death, but my stackexchange search-foo is failing me. There *is a few recommendations for using pgBarman. Is it still a good direction to go?

Coming from a setup where MSSQL backups were handled by IT with a full backup taken daily in the morning with hourly transaction logs. I'm wondering what would be an equivalent setup in postgresql to give the ability of point-in-time recovery for the past week or two?

Best Answer

The main problems with the example from docs is that it doesn't force the copy to be synced to disk before it reports success, meaning the database might be allowed to delete the original file, but then a poorly timed OS crash means the copy never reaches disk safely and so is not truly available.

Another problem is that if the copy fails (or succeeds, but reports to have failed, because something prevented the success message from being heard) but first created a file with the correct name, then when it is tries again the test command will cause it to keep on failing forever, until someone manually intervenes. If you remove the test, then if you accidentally configure two servers to archive to the same spot, they can silently overwrite each others files. So a complete solution will only allow WAL files that belong to the proper basebackup to get saved in the first place, and will allow a file to be saved more than once as long as the new version is identical to, or an extension of, the earlier one. Of course you generally want to archive to a different machine anyway, so the example archive command doing a cp operation is often just generally wrong in that regard.

pgbarman has its flaws and limitations, but does get around the problems mentioned above. pgbackrest is another option (one I have not played around with). If you use log shipping with pgbarman, you would need to set archive_timeout to one hour. If your server doesn't see much activity, which can mean a lot of WAL files with little useful data in each hanging around. You can get around this by using streaming mode instead.

Another option is to use pg_receivewal. Or to use a streaming replica with a replication slot and with archive_mode set to 'always'. Then the replica will create the archive for you. You still have the problem though that the archive_command on the streaming replica still needs to force the copy of the file to be synced before reporting success, if you want it to be bulletproof.