PostgreSQL Replication – Fixing ‘No Such File or Directory’ for WAL File

postgresqlreplication

I'm studying postgresql recently. Now I'm learning how to configure PG replication with slot. — So I'm following this sample — https://girders.org/postgresql/2021/11/05/setup-postgresql14-replication/

But all attempts are failed with "No such file or directory" when there's new tran raised from primary side. I can't find the issue and the fix. Please kind have a look and help.

Environment —

VM1 -- 100.70.224.70/23 -- primary side
VM2 -- 100.70.225.241/23 -- replica
PG version -- 14.10

Steps:

  1. On the both sides , there're a PG instance already, the PGDATA are both /var/lib/pgsql/data and the archive folder are both /tmp/pgbak
    Both are configured archive with same configuration —
archive_command = 'test ! -f /tmp/pgbak/%f && cp %p /tmp/pgbak/%f'
archive_timeout = '1min'
archive_mode = 'on'
archive_cleanup_command = 'pg_archivecleanup archivelocation %r'
restore_command = 'cp /tmp/pgbak/%f %p'
  1. On replica, clean thePG data folder — rm -rf /var/lib/pgsql/data/*
    and also clean the archive folder — rm -rf /tmp/pgbak
  2. On primary , edit postgresql.conf —
wal_level = replica
max_wal_senders = 10
wal_keep_size = '1GB'
wal_compression = on
  1. On primary, create the replication user —
   createuser -U postgres --replication repl
  1. On primary, edit the pg_hba.conf added 2 lines–
host    all             all             0.0.0.0/0            trust
host    replication     all             0.0.0.0/0               trust

And restart the PG instance on primary.
Then tested on replica with psql , it works fine.

  1. Create replication slot on primary —
select * from pg_create_physical_replication_slot('db02_repl_slot');
  1. On replica, make a basebackup —
pg_basebackup --pgdata /var/lib/pgsql/data --format=p --write-recovery-conf --checkpoint=fast --label=mffb --progress --host=100.70.224.70 -R --username=repl
  1. On replica, add the replication slot to postgresql.auto.conf —
primary_conninfo = 'user=repl passfile=''/var/lib/pgsql/.pgpass'' channel_binding=prefer host=100.70.224.70 port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable application_name=db02.repl'
primary_slot_name = 'db02_repl_slot'
  1. Start the PG instance on replica, and check the slot status on primary, it looks fine —
eisendb=# select slot_name, slot_type, active, wal_status from pg_replication_slots;
   slot_name    | slot_type | active | wal_status
----------------+-----------+--------+------------
 db02_repl_slot | physical  | t      | reserved
  1. Then I tested with data modification on primary, then I found no data transferred to replica, and from the error log file on replica , I found this error —
2024-01-01 12:30:07.066 UTC   [4737]CONTEXT:  WAL redo at D5/75000060 for Standby/RUNNING_XACTS: nextXid 10361 latestCompletedXid 10360 oldestRunningXid 10361
2024-01-01 12:30:07.066 UTC   [4737]DEBUG:  executing restore command "cp /tmp/pgbak/00000001000000D500000076 pg_wal/RECOVERYXLOG"
2024-01-01 12:30:07.068 UTC   [4741]DEBUG:  checkpointer updated shared memory configuration values
cp: cannot stat '/tmp/pgbak/00000001000000D500000076': No such file or directory
2024-01-01 12:30:07.069 UTC   [4737]DEBUG:  could not restore file "00000001000000D500000076" from archive: child process exited with exit code 1
2024-01-01 12:30:07.069 UTC   [4737]DEBUG:  prune KnownAssignedXids to 10361


And this is the log records filtered the debug messages --
2024-01-01 13:09:12.622 UTC   [8367]LOG:  database system was interrupted; last known up at 2024-01-01 13:05:59 UTC
cp: cannot stat '/tmp/pgbak/00000002.history': No such file or directory
2024-01-01 13:09:12.634 UTC   [8367]LOG:  entering standby mode
cp: cannot stat '/tmp/pgbak/00000001000000D500000082': No such file or directory
2024-01-01 13:09:12.639 UTC   [8367]LOG:  redo starts at D5/82000028
2024-01-01 13:09:12.640 UTC   [8367]LOG:  consistent recovery state reached at D5/82000138
2024-01-01 13:09:12.640 UTC   [8362]LOG:  database system is ready to accept read-only connections
cp: cannot stat '/tmp/pgbak/00000001000000D500000083': No such file or directory
2024-01-01 13:09:12.649 UTC   [8374]LOG:  started streaming WAL from primary at D5/83000000 on timeline 1
2024-01-01 13:10:28.316 UTC   [8367]LOG:  recovery stopping before commit of transaction 10363, time 2024-01-01 13:10:28.315075+00
2024-01-01 13:10:28.316 UTC   [8367]LOG:  pausing at the end of recovery
2024-01-01 13:10:28.316 UTC   [8367]HINT:  Execute pg_wal_replay_resume() to promote.


It looks the replica is searching for some archived wal in the archive folder but no found. And meanwhile I checked the /tmp/pgbak on replica and find it's empty as well…I'm not familiar with PG replication details, so I'm thinking if there's any error in my configuration so that the archived wal on primary can't be replicated to replica?
If yes, please kind correct me. Thanks in advance.

Best Answer

That looks very much like you have one of the recovery_target_* parameters set on the standby, so that recovery stops at that point. The setting was probably copied from the primary when you ran pg_basebackup.