PostgreSQL 11 – Terminating walreceiver Process Due to Administrator Command

postgresqlpostgresql-11

one of my PostgreSQL 11 servers was throwing a strange error and refused to correctly initiate streaming replication after a basebackup:

FATAL: terminating walreceiver process due to administrator command

I tried enabling the highest log verbosity (Debug5), but that didn't provide more insight as to why it kept dying:

2021-07-29 15:33:30.862 UTC [39315] DEBUG:  postgres: PostmasterMain: initial environment dump:
2021-07-29 15:33:30.862 UTC [39315] DEBUG:  -----------------------------------------
2021-07-29 15:33:30.862 UTC [39315] DEBUG:      PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
2021-07-29 15:33:30.862 UTC [39315] DEBUG:      PG_GRANDPARENT_PID=39310
2021-07-29 15:33:30.862 UTC [39315] DEBUG:      PGLOCALEDIR=/usr/share/locale
2021-07-29 15:33:30.862 UTC [39315] DEBUG:      PGSYSCONFDIR=/etc/postgresql-common
2021-07-29 15:33:30.862 UTC [39315] DEBUG:      LANG=en_US.UTF-8
2021-07-29 15:33:30.862 UTC [39315] DEBUG:      PWD=/
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      PGDATA=/var/lib/postgresql/11/replica
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      LC_COLLATE=en_US.UTF-8
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      LC_CTYPE=en_US.UTF-8
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      LC_MESSAGES=en_US.UTF-8
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      LC_MONETARY=C
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      LC_NUMERIC=C
2021-07-29 15:33:30.863 UTC [39315] DEBUG:      LC_TIME=C
2021-07-29 15:33:30.863 UTC [39315] DEBUG:  -----------------------------------------
2021-07-29 15:33:30.867 UTC [39315] DEBUG:  registering background worker "logical replication launcher"
2021-07-29 15:33:30.868 UTC [39315] LOG:  listening on IPv6 address "::1", port 5433
2021-07-29 15:33:30.868 UTC [39315] LOG:  listening on IPv4 address "127.0.0.1", port 5433
2021-07-29 15:33:30.868 UTC [39315] LOG:  listening on IPv4 address "*snip*", port 5433
2021-07-29 15:33:30.868 UTC [39315] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2021-07-29 15:33:30.869 UTC [39315] DEBUG:  invoking IpcMemoryCreate(size=156459008)
2021-07-29 15:33:30.869 UTC [39315] DEBUG:  mmap(157286400) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2021-07-29 15:33:30.885 UTC [39315] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2021-07-29 15:33:30.885 UTC [39315] DEBUG:  removing file "pg_notify/0000"
2021-07-29 15:33:30.885 UTC [39315] DEBUG:  dynamic shared memory system will support 588 segments
2021-07-29 15:33:30.885 UTC [39315] DEBUG:  created dynamic shared memory control segment 1597797971 (14128 bytes)
2021-07-29 15:33:30.887 UTC [39315] DEBUG:  max_safe_fds = 982, usable_fds = 1000, already_open = 8
2021-07-29 15:33:30.889 UTC [39316] LOG:  database system was shut down in recovery at 2021-07-29 15:31:23 UTC
2021-07-29 15:33:30.890 UTC [39316] DEBUG:  standby_mode = 'on'
2021-07-29 15:33:30.890 UTC [39316] DEBUG:  primary_conninfo = '*snip*'
2021-07-29 15:33:30.890 UTC [39316] DEBUG:  recovery_target_timeline = latest
2021-07-29 15:33:30.890 UTC [39316] LOG:  entering standby mode
2021-07-29 15:33:30.890 UTC [39316] LOG:  invalid resource manager ID 172 at 235/450000D0
2021-07-29 15:33:30.890 UTC [39316] DEBUG:  switched WAL source from archive to stream after failure
2021-07-29 15:33:30.891 UTC [39317] DEBUG:  find_in_dynamic_libpath: trying "/usr/lib/postgresql/11/lib/libpqwalreceiver"
2021-07-29 15:33:30.905 UTC [39317] DEBUG:  find_in_dynamic_libpath: trying "/usr/lib/postgresql/11/lib/libpqwalreceiver.so"
2021-07-29 15:33:30.918 UTC [39317] LOG:  started streaming WAL from primary at 235/45000000 on timeline 1
2021-07-29 15:33:30.920 UTC [39317] DEBUG:  sendtime 2021-07-29 15:33:30.917704+00 receipttime 2021-07-29 15:33:30.920718+00 replication apply delay (N/A) transfer latency 3 ms
2021-07-29 15:33:30.920 UTC [39317] DEBUG:  sending write 235/45020000 flush 0/0 apply 0/0
2021-07-29 15:33:30.921 UTC [39317] DEBUG:  sending write 235/45020000 flush 235/45020000 apply 0/0
2021-07-29 15:33:30.921 UTC [39316] LOG:  invalid resource manager ID 172 at 235/450000D0
2021-07-29 15:33:30.921 UTC [39317] FATAL:  terminating walreceiver process due to administrator command
2021-07-29 15:33:30.921 UTC [39317] DEBUG:  shmem_exit(1): 1 before_shmem_exit callbacks to make
2021-07-29 15:33:30.921 UTC [39317] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2021-07-29 15:33:30.921 UTC [39317] DEBUG:  proc_exit(1): 2 callbacks to make
2021-07-29 15:33:30.921 UTC [39317] DEBUG:  exit(1)
2021-07-29 15:33:30.921 UTC [39316] DEBUG:  switched WAL source from stream to archive after failure

The only concerning were the LOG level messages like

LOG: invalid resource manager ID 172 at 235/450000D0

It turned out, however, that these are just Postgres' way of saying "Reached an end of valid WAL structures" and LOG level messages can be safely ignored.

I tried removing existing WAL logs (from datadir/pg_wal/), thinking maybe a file corruption was at play, the server still would not initiate replication. The only solution was to make a completely new basebackup.

My question – Is the terminating xyz process due to administrator command a default message Postgres gives when one of its processes dies in a non-standard way?

Would there be further debugging options in cases like this? Not even the highest logging verbosity provided much useful information.

Best Answer

What is the full version of PostgreSQL on both master and replica?

That looks like it probably was corruption, and taking a new base backup is all you could do (and investigate your hardware, to see if you can figure out why there was corruption).

I assume you removed WAL from the replica, not from the master. So it would re-fetch the same WAL files, and find them to be still be corrupt. It didn't have to be that way, but presumably it was or you wouldn't be asking the question. If the corruption had been network glitches that happened to the files in transit, then the new copy might be good, but apparently the corruption exists in the original WAL files on the master and so is still corrupt no matter how many times you read it. Also, if the corruption happened just on the disk when the original WAL file was written to disk, then the archive might not be corrupted. If the archive command copying the WAL file to the archive took the data out of the filesystem cache in RAM, then the copy wouldn't be corrupted. So either the corruption happened in RAM itself, or the archive command had to read the (now corrupted) data back from disk in order to copy it to the archive. Or it happened even earlier, like a software bug.

There is no extra information to be had. The system can't figure out what it is looking at, so all it can do is puke the mess up for us to speculate about. But yes, it did command the wal receiver to exit, there is no point in having it keep running since it can't replay past the point of the corruption. So it is switching back to the WAL archive, to see if by any chance the file there is not corrupted.