Postgresql – How to restore PostgreSQL standby server from archive

postgresqlpostgresql-9.5replicationrepmgr

I have a PostgreSQL 9.5 cluster with master and standby server that are replicated using repmgr. Master is configured to keep 4000 WAL files:

wal_level = logical
hot_standby = on
archive_command = 'test ! -f /mnt/share/psql/archive/psql/%f && cp %p /mnt/share/psql/archive/psql/%f'
max_worker_processes = 10                        
max_replication_slots = 10
max_wal_senders = 10
wal_keep_segments = 4000

At some point the standby server falls behind master (I have no idea why is this happening):

2017-01-28 23:49:24 UTC ERROR:  current transaction is aborted, commands ignored until end of transaction block
2017-01-28 23:49:24 UTC STATEMENT:  
2017-01-28 23:51:10 UTC LOG:  invalid magic number 0000 in log segment 0000000200001E5A000000C8, offset 5201920
2017-01-28 23:51:10 UTC LOG:  started streaming WAL from primary at 1E5A/C8000000 on timeline 2
2017-01-29 00:04:59 UTC FATAL:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

2017-01-29 00:16:10 UTC LOG:  invalid magic number 0000 in log segment 0000000200001E5B000000C5, offset 5242880
2017-01-29 00:16:10 UTC LOG:  started streaming WAL from primary at 1E5B/C5000000 on timeline 2
2017-01-29 00:50:07 UTC FATAL:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

2017-01-29 00:50:07 UTC LOG:  invalid magic number 0000 in log segment 0000000200001E5D0000000A, offset 5373952
2017-01-29 00:50:22 UTC LOG:  started streaming WAL from primary at 1E5D/A000000 on timeline 2
2017-01-29 03:29:35 UTC FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000200001E64000000B7 has already been removed

Obviously master removed needed WAL 0000000200001E64000000B7, but the file is still present in the archive.
Anyway, after standby restart the database seems to reach consistent state:

...
2017-01-29 13:45:35 UTC LOG:  restored log file "0000000200001E64000000B1" from archive
2017-01-29 13:45:36 UTC LOG:  restored log file "0000000200001E64000000B2" from archive
2017-01-29 13:45:36 UTC LOG:  restored log file "0000000200001E64000000B3" from archive
2017-01-29 13:45:36 UTC LOG:  restored log file "0000000200001E64000000B4" from archive
2017-01-29 13:45:37 UTC LOG:  restored log file "0000000200001E64000000B5" from archive
2017-01-29 13:45:37 UTC LOG:  restored log file "0000000200001E64000000B6" from archive
2017-01-29 13:45:37 UTC LOG:  restored log file "0000000200001E64000000B7" from archive
2017-01-29 13:45:38 UTC LOG:  consistent recovery state reached at 1E64/B7DFFD78
2017-01-29 13:45:38 UTC LOG:  recovery stopping after reaching consistency
2017-01-29 13:45:38 UTC LOG:  recovery has paused
2017-01-29 13:45:38 UTC HINT:  Execute pg_xlog_replay_resume() to continue.
2017-01-29 13:45:38 UTC LOG:  database system is ready to accept read only connections

PostgreSQL suggest to execute pg_xlog_replay_resume(), but this would lead to promoting standby to master, thus a split-brain situation.

$ repmgr cluster show
Role      | Name                    | Upstream                | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master  | psql01a                 |                         | host=psql01a user=repmgr
  standby | psql01b                 | psql01a                 | host=psql01b user=repmgr

recovery.conf:

restore_command = 'cp /mnt/share/psql/archive/psql/%f %p'
recovery_target_inclusive = true
recovery_target = 'immediate'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'user=repmgr port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres host=psql01a application_name=psql01b password=ZDIzNjk2OTM2MWYyNjNiYzk5ZDVhMWIw'
recovery_min_apply_delay = 0

PostgreSQL 9.5 comes with a new parameter recovery_target_action that can have 3 values in recovery.conf:

  • pause (default) recovery will be paused
  • shutdown will stop the server after reaching the recovery target
  • promote means the recovery process will finish and the server will start to accept connections

promote seems to be logical choice according to the documentation, however it leads again to split-brain situation:

2017-01-29 19:31:27 UTC LOG:  consistent recovery state reached at 1E64/B7DFFD78
2017-01-29 19:31:27 UTC LOG:  recovery stopping after reaching consistency
2017-01-29 19:31:27 UTC LOG:  redo done at 1E64/B7DFFD78
2017-01-29 19:31:27 UTC LOG:  last completed transaction was at log time 2017-01-29 00:32:06.442239+00
2017-01-29 19:31:27 UTC LOG:  database system is ready to accept read only connections
cp: cannot stat '/mnt/share/psql/archive/psql/00000003.history': No such file or directory
2017-01-29 19:31:27 UTC LOG:  selected new timeline ID: 3
cp: cannot stat '/mnt/share/psql/archive/psql/00000002.history': No such file or directory
2017-01-29 19:31:28 UTC LOG:  archive recovery complete

$ repmgr cluster show
Role      | Name                    | Upstream                | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master  | psql01a                 |                         | host=psql01a user=repmgr
* master  | psql01b                 | psql01a                 | host=psql01b user=repmgr

The question is how to resume WAL streaming without promoting standby to master? Is it possible to do this in an automatic manner?

Best Answer

Don't specify a recovery_target. Then recovery_target_action doesn't matter.