PostgreSQL Replication – Fixing ‘No Such File or Directory’ for WAL File

postgresqlreplication

I'm studying postgresql recently. Now I'm learning how to configure PG replication with slot. — So I'm following this sample — https://girders.org/postgresql/2021/11/05/setup-postgresql14-replication/

But all attempts are failed with "No such file or directory" when there's new tran raised from primary side. I can't find the issue and the fix. Please kind have a look and help.

Environment —

VM1 -- 100.70.224.70/23 -- primary side
VM2 -- 100.70.225.241/23 -- replica
PG version -- 14.10

Steps:

On the both sides , there're a PG instance already, the PGDATA are both /var/lib/pgsql/data and the archive folder are both /tmp/pgbak
Both are configured archive with same configuration —

archive_command = 'test ! -f /tmp/pgbak/%f && cp %p /tmp/pgbak/%f'
archive_timeout = '1min'
archive_mode = 'on'
archive_cleanup_command = 'pg_archivecleanup archivelocation %r'
restore_command = 'cp /tmp/pgbak/%f %p'

On replica, clean thePG data folder — rm -rf /var/lib/pgsql/data/*
and also clean the archive folder — rm -rf /tmp/pgbak
On primary , edit postgresql.conf —

wal_level = replica
max_wal_senders = 10
wal_keep_size = '1GB'
wal_compression = on

On primary, create the replication user —

   createuser -U postgres --replication repl

On primary, edit the pg_hba.conf added 2 lines–

host    all             all             0.0.0.0/0            trust
host    replication     all             0.0.0.0/0               trust

And restart the PG instance on primary.
Then tested on replica with psql , it works fine.

Create replication slot on primary —

select * from pg_create_physical_replication_slot('db02_repl_slot');

On replica, make a basebackup —

pg_basebackup --pgdata /var/lib/pgsql/data --format=p --write-recovery-conf --checkpoint=fast --label=mffb --progress --host=100.70.224.70 -R --username=repl

On replica, add the replication slot to postgresql.auto.conf —

primary_conninfo = 'user=repl passfile=''/var/lib/pgsql/.pgpass'' channel_binding=prefer host=100.70.224.70 port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable application_name=db02.repl'
primary_slot_name = 'db02_repl_slot'

Start the PG instance on replica, and check the slot status on primary, it looks fine —

eisendb=# select slot_name, slot_type, active, wal_status from pg_replication_slots;
   slot_name    | slot_type | active | wal_status
----------------+-----------+--------+------------
 db02_repl_slot | physical  | t      | reserved

Then I tested with data modification on primary, then I found no data transferred to replica, and from the error log file on replica , I found this error —

2024-01-01 12:30:07.066 UTC   [4737]CONTEXT:  WAL redo at D5/75000060 for Standby/RUNNING_XACTS: nextXid 10361 latestCompletedXid 10360 oldestRunningXid 10361
2024-01-01 12:30:07.066 UTC   [4737]DEBUG:  executing restore command "cp /tmp/pgbak/00000001000000D500000076 pg_wal/RECOVERYXLOG"
2024-01-01 12:30:07.068 UTC   [4741]DEBUG:  checkpointer updated shared memory configuration values
cp: cannot stat '/tmp/pgbak/00000001000000D500000076': No such file or directory
2024-01-01 12:30:07.069 UTC   [4737]DEBUG:  could not restore file "00000001000000D500000076" from archive: child process exited with exit code 1
2024-01-01 12:30:07.069 UTC   [4737]DEBUG:  prune KnownAssignedXids to 10361


And this is the log records filtered the debug messages --
2024-01-01 13:09:12.622 UTC   [8367]LOG:  database system was interrupted; last known up at 2024-01-01 13:05:59 UTC
cp: cannot stat '/tmp/pgbak/00000002.history': No such file or directory
2024-01-01 13:09:12.634 UTC   [8367]LOG:  entering standby mode
cp: cannot stat '/tmp/pgbak/00000001000000D500000082': No such file or directory
2024-01-01 13:09:12.639 UTC   [8367]LOG:  redo starts at D5/82000028
2024-01-01 13:09:12.640 UTC   [8367]LOG:  consistent recovery state reached at D5/82000138
2024-01-01 13:09:12.640 UTC   [8362]LOG:  database system is ready to accept read-only connections
cp: cannot stat '/tmp/pgbak/00000001000000D500000083': No such file or directory
2024-01-01 13:09:12.649 UTC   [8374]LOG:  started streaming WAL from primary at D5/83000000 on timeline 1
2024-01-01 13:10:28.316 UTC   [8367]LOG:  recovery stopping before commit of transaction 10363, time 2024-01-01 13:10:28.315075+00
2024-01-01 13:10:28.316 UTC   [8367]LOG:  pausing at the end of recovery
2024-01-01 13:10:28.316 UTC   [8367]HINT:  Execute pg_wal_replay_resume() to promote.

It looks the replica is searching for some archived wal in the archive folder but no found. And meanwhile I checked the /tmp/pgbak on replica and find it's empty as well…I'm not familiar with PG replication details, so I'm thinking if there's any error in my configuration so that the archived wal on primary can't be replicated to replica?
If yes, please kind correct me. Thanks in advance.

Best Answer

That looks very much like you have one of the recovery_target_* parameters set on the standby, so that recovery stops at that point. The setting was probably copied from the primary when you ran pg_basebackup.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql streaming replication and wal archive

Further testing revealed that the different WAL files were mostly those that had been rotated but not yet written to. The new slave would try to archive these (possibly fixed in 9.2) but they wouldn't be used for recovery.

The other problem is the overwriting. We disabled the overwrite (instead storing with a new filename so we at least keep the WAL just in case) and everything works... both failover and later restore.

NOTE: my earlier answer here that was completely wrong. See the edit log if you need to see how wrong.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql streaming replication and wal archive

Related Question