Postgresql – how long can your replica be offline when using repmgr with Barman

barmanpostgresqlreplicationrepmgr

I'm setting up some postgres database EC2 instances that I'd like to use for load balancing. The application I'm running has some very expensive and very unique queries so CPU usage is a concern. While a single instance is OK most of the time, I'd like to quickly be able to spin up some read replicas when we're expecting to process a lot of transactions.

The issue is that it could be days or weeks in between needing to bring up these machines. Since we're using repmgr with Barman, it's very quick to clone a server. But ideally we'd just like to start\stop instances as needed with little thought\overhead.

My question is, when a replica comes back online after being offline for a while, and the WALs on the primary have long vanished… Is repmgr on the replicas smart enough to know to get the backup data from barman? I would have initially thought no, except I had a replica offline for a week, brought it online and when I checked the database, it was in sync with the primary. pg_wal only had 2 days of wals on primary.

I do have restore_command='/usr/bin/barman-wal-restore barman node1 %f %p' but I thought that was more for initial cloning or recovery.

Best Answer

If restore_command is set to restore from an archive source (in this case Barman), the replica will attempt to fetch any WAL not available on the primary (or upstream in the case of cascading replication) from that source. So in this case as long as Barman has the WALs for the period the replica was offline, it will catch up.

You'll need to check the Barman configuration to see how long it is configured to retain backups/WAL, which will limit how long a replica can be safely offline for. Execute barman show-server $SERVER (probably node1) to list the active configuration (might take a few seconds to run) and look for retention_policy and wal_retention_policy.

I do have restore_command='/usr/bin/barman-wal-restore barman node1 %f %p' but I thought that was more for initial cloning or recovery.

If the initial cloning is from Barman, it basically rsyncs the latest full backup from the Barman server. The restore_command will then fetch any WAL not available on the primary/upstream from Barman, and it is left in place as a backup source of WAL for situations (like the case described here) where streaming replication is interrupted for whatever reason.