Postgresql – Safely shutting down seemingly misconfigured (but working) PostgreSQL replication/archiving

barmanpostgresqlreplication

I'm helping administer two PostgreSQL servers (one primary, one replica) plus a separate server running Barman for WAL archiving. I was not originally involved in setting up the replication and archiving. The configuration is rather old, some of it dating back to 2015 and PostgreSQL 9.3, although we're now running 9.5. The history of configuration changes is sadly not documented. The person who built the setup is still around, but this was their first time setting up replication as well.

Some time ago we ran into a mysterious data corruption issue on our primary server after some storage system troubles at our VPS provider. These issues were present even when restoring a Barman backup to a fresh server. This was preventing us from cleanly upgrading to PostgreSQL 12, but we determined that we were able to cleanly dump our actual production database in the main cluster and restore it to a newly created recovery cluster.

We decided to shut down our replication + Barman archiving setup and start them over from scratch as well, and in researching how to do this I ran into some interesting configuration issues. tl;dr: the current setup is working, but not how I thought it was, and now I need guidance on how to cleanly shut it off.

The first clue that something was off was when I was looking at what replication slots were configured on the primary server with select * from pg_replication_slots. This produced no results, to my confusion. My assumption (based on the description of the person who built the setup) was that we were using slots at least for Barman, and I thought we would be doing so for the replication as well.

Before I go further, here are the relevant settings from the three servers, with sensitive details altered:

# Primary server's postgresql.conf
wal_level = hot_standby
max_wal_senders = 4

archive_mode = on
archive_command = 'rsync -a %p companyuser@backup-server:/backup/thingamabob/incoming/%f'
max_replication_slots = 2
# Replica server's postgresql.conf
wal_level = hot_standby
hot_standby = on
# Replica server's recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=primary-server user=postgres'
trigger_file = '/tmp/trigger_file0'
# /home/companyuser/.barman.conf on the backup server
[barman]
barman_home = /backup
barman_user = companyuser
log_file = /backup/barman.log
minimum_redundancy = 1

[thingamabob]
description = "Thingamabob"
archiver = on
# 5436 is the SSH tunneled port to our primary DB server, which has
# a 'barman' user
conninfo = host=localhost port=5436 user=barman dbname=postgres
backup_method = postgres
slot_name = backup
retention_policy = RECOVERY WINDOW OF 4 DAYS
retention_policy_mode = auto

Let's look at the Barman WAL archiving side first, which I think I've figured out. archive_mode and archive_command are rsyncing the completed WAL segments from the primary server to the backup server, and Barman has the requisite control connection to the primary server. Apparently due to a misunderstanding, .barman.conf has slot_name = backup set, even though we're using traditional WAL archiving only, and slot_name is used by the more recent WAL streaming functionality of Barman. The dead giveaway here is the lack of the streaming_archiver and streaming_conninfo settings. So, the setting is useless, but fortunately harmless in this setup.

What I don't really understand is the replication setup. We're not using slots here (recovery.conf doesn't have the primary_slot_name setting), and the documentation for the standby_mode and primary_conninfo settings is not telling me how the replication is actually functioning correctly.

Finally, some concrete questions based on the information presented so far:

  1. We're not using replication slots, nor wal_keep_segments, nor restore_command. My assumption is that our replication setup is working through sheer luck: the replica server has not had significant downtime, and therefore whatever the default WAL file retention period of the primary server is has been sufficient for our replication to never fail due to the primary server removing an old WAL file that the replica has not yet received. Is this assumption correct?

  2. How does the replica server know when and from where to retrieve WAL files from the primary server? What in our settings is actually doing this, or are PostgreSQL defaults just achieving this? Is it polling something over primary_conninfo?

  3. I assume that the simple way to end the Barman WAL archiving is to remove the archive_mode and archive_command settings from the primary server's configuration, and then to stop barman cron from running once per minute on the backup server. Is this correct? Based on my reading of the documentation, archive_mode is not involved in the replication side of things.

  4. What is the correct way of terminating the replication? My current assumption is to do one of the two things below, but is either one of these actually correct?

  • shut down Postgres on the replica server, remove the standby_mode setting, remove recovery.conf and upon starting Postgres again, it'll act like the primary server.
  • create the trigger file and wait for recovery to be done, and then possibly remove standby_mode and restart Postgres..? I'm not quite sure how standby_mode behaves after recovery has ended.

Best Answer

  1. Yes.

  2. The standby server regularly performs restorepoints, which are checkpoints in standby mode. On startup, it checks its control file for the lateat restorepoint and recovers from there. After it has completed replaying local WAL, it connects to the primary and asks for WAL from that point on.

  3. Yes.

  4. Shut down both servers in any order you like. If you want to make sure that the replica has caught up, shut down the primary server first.