Fixing Failed PostgreSQL Replication from Master to Standby

barmanpostgresqlpostgresql-9.6replication

I have a master-standby postgres cluster; For availability issue, I want to add a new standy server.

So I created a new server, did base installation as installing postgres, creating postgres data filesystem, then initiated a pgbasebackup to the new standby (tried it many times, many from the master and many from the first standby, all failing).

pg_basebackup -D – -h localhost -U replicator -Ft –compress=0 –progress | pigz -p $THREADS | ssh -A postgres@$TARGETDB "pigz -dc – | tar xvf – –directory=/var/lib/pgsql/9.6/data/"

When it finished, and I start the postgres, it fails with missing WALs and deviated timeline, though I am pretty sure these requested WALs and history file and not even there on neither the primary not the secondary.

2022-01-24 11:32:00 GMT [17951]: [1-1] user=,db=,app=,client= LOG:  database system was interrupted while in recovery at log time 2022-01-24 11:10:02 GMT
2022-01-24 11:32:00 GMT [17951]: [2-1] user=,db=,app=,client= HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2022-01-24 11:32:01 GMT [17951]: [3-1] user=,db=,app=,client= LOG:  restored log file "00000007.history" from archive
ERROR: WAL file '00000008.history' not found in server '****' (SSH host: 10.154.129.90)
2022-01-24 11:32:01 GMT [17951]: [4-1] user=,db=,app=,client= LOG:  entering standby mode
2022-01-24 11:32:02 GMT [17951]: [5-1] user=,db=,app=,client= LOG:  restored log file "00000007.history" from archive
ERROR: WAL file '0000000700001C55000000F4' not found in server '****' (SSH host: 10.154.129.90)
2022-01-24 11:32:03 GMT [17951]: [6-1] user=,db=,app=,client= LOG:  restored log file "0000000600001C55000000F4" from archive
ERROR: WAL file '0000000700001C55000000F3' not found in server '****' (SSH host: 10.154.129.90)
2022-01-24 11:32:04 GMT [18134]: [1-1] user=postgres,db=postgres,app=[unknown],client=[local] FATAL:  the database system is starting up
2022-01-24 11:32:04 GMT [18145]: [1-1] user=postgres,db=postgres,app=[unknown],client=[local] FATAL:  the database system is starting up
2022-01-24 11:32:04 GMT [17951]: [7-1] user=,db=,app=,client= LOG:  restored log file "0000000600001C55000000F3" from archive
2022-01-24 11:32:04 GMT [17951]: [8-1] user=,db=,app=,client= FATAL:  requested timeline 7 is not a child of this server's history
2022-01-24 11:32:04 GMT [17951]: [9-1] user=,db=,app=,client= DETAIL:  Latest checkpoint is at 1C56/47CACF28 on timeline 6, but in the history of the requested timeline, the server forked off from that timeline at 1C3F/B7B96B90.
2022-01-24 11:32:04 GMT [17948]: [3-1] user=,db=,app=,client= LOG:  startup process (PID 17951) exited with exit code 1
2022-01-24 11:32:04 GMT [17948]: [4-1] user=,db=,app=,client= LOG:  aborting startup due to startup process failure
2022-01-24 11:32:04 GMT [17948]: [5-1] user=,db=,app=,client= LOG:  database system is shut down

Also, we have a WAL archive system to a barman server, so missing WALs during backup is not also a suspicion.

Recovery.conf file

standby_mode = 'on'
primary_conninfo = 'user=replicator password=C0D5wallop host=$PRIMARYSERVER port=5432 sslmode=prefer sslcompression=1'
trigger_file = '/var/lib/pgsql/9.6/boo'
recovery_target_timeline='latest'
restore_command = 'ssh -o StrictHostKeyChecking=no barman@$BARMANSERVER barman get-wal db-44 %f > %p'

Remain open for more info. Appreciate your help.

Best Answer

It ends up that I did a test restoration in the past on a new db instance, without removing the archive_command in the postgresql.conf file, which ended up archiving that 0000007.history empty timeline.

So, when a new server was trying to fetch the archived logs from the barman, it would find that dummy timeline 0000007.history file, but without actual xlogs in the barman server, leading up to the generated error logs above.

Solution:

Connect to the barman server.
Manually moving the 0000007.history file.
Manually removing the 00000007.history line in the xlog.db on the barman server wals/ directory.
Restart the postgres on the secondary.

Advise: Take backup of anything you change in the barman server before doing it.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql – Streaming Replication in PostgreSQL

PostgreSQL replicas never finish recovering. This is by design. Basically a replica is always in "recovering from disaster" mode except that it is using receiving the WAL segments from the master rather than on disk.

So what you are seeing is not cause for concern. If it is not working yet, then you will need to provide a more detailed description of what you are trying to do and what is not working. But as far as you are posting it seems normal.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql – Streaming Replication in PostgreSQL

Related Question