Postgresql – How to restore Postgresql without pg_control but with file system backup and wal files

postgresqlpostgresql-10restore

OS: Ubuntu 16.04, Docker

Postgresql: 10

I'm using an alternative file-system backup that is making “consistent snapshot” of the data directory every 24 hours. I'm experiencing the warning from the Postgresql documentation. See link above.

and you are willing to trust that it is implemented correctly …

The problem is that the files are incomplete after 1 week. For whatever reason my pg_control file has only been backed up for 6 days. This leads to the problem that restoring a file system backup that is older than 6 days throws an error that the pg_control file is missing.

A positive aspect of my backup is that I have a complete history of WAL files. So not only 6 days but endless into the past.

So my plan is to work with the copy that is 6 days old and try to replay the WAL logs until I reached my desired state. Let's work with a real dates.

My database starts just fine for the date 2019-01-27 and I now want to replay WAL files to get the database state to 2018-12-18.


Prerequisite
I copied the server files from the 2019-01-27 to a local machine where the same version of Postgresql (10) is running. The backup has been taken while Postgresql was running. Anyways as explained in the documentation it just gives a short warning and starts fine locally.

Intent 1: Copy WAL files into pg_wal and start

  1. Clear the main directory: rm -rf /var/lib/postgresql/10/main/*
  2. Extract the copy that works fine. Change ownership and permissions. Remove postmaster.pid.
  3. Copy all pg_wal files to pg_wal
  4. As postgres user: /usr/lib/postgresql/10/bin postgres -D /.../main
  5. Everything works fine.

Logs:

LOG: database system was interrupted; last known up at 2019-01-27 20:00:00 UTC
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 1/D160878
LOG: invalid record length at 1/D160958; wanted 24, got 0
LOG: redo done at 1/D160920
LOG: database system is ready to accept connections

Weird for me is that the WAL files total number has been decreased inside of pg_wal.

  1. Stop the server.

  2. Create a recovery.conf

restore_command = 'cp /dir/with/all_wal_files_backup/%f "%p"'
recovery_target_time = '2018-12-18 20:00:00 UTC'
recovery_target_inclusive = false
  1. Start the server

LOGS:

LOG: starting point-in-teim recovery to 2018-12-18 20:00:00+00
cp: cannot stat /dir/with/all_wal_files_backup/00010001D: No such file or dir
consisten recovery state reached at 1/D160a70
redo is not required
db is ready to accept read only connections
cp: cannot stat /dir/with/all_wal_files_backup/00010001D: No such file or dir
cp: cannot stat /dir/with/all_wal_files_backup/0002.history: No such file or dir
LOG: selected new timeline ID: 2
Archive recovery complete
cp: cannot stat /dir/with/all_wal_files_backup/0001.history: No such file or dir
db is ready to accept read only connections

Now when I try to select the data that I'm searching for they are not there. Where is my mistake?

I already tried to pg_basebackup and then to restore and countless of other forms but always without success so I hope that someone here can help me out.

thanks in advance


Update

I tried the procedure as explained in the comment with the following result: I still can't select the data I'm expecting. Even though it tells me succes it seems like the recovery is broken. Any ideas?

2019-02-04 14:08:16.330 UTC [12412] LOG:  database system was interrupted; last known up at 2019-01-27 20:04:15 UTC
2019-02-04 14:08:18.181 UTC [12412] LOG:  starting point-in-time recovery to 2018-12-15 08:00:00+00
2019-02-04 14:08:18.181 UTC [12412] LOG:  database system was not properly shut down; automatic recovery in progress
2019-02-04 14:08:18.185 UTC [12412] LOG:  redo starts at 1/D160878
2019-02-04 14:08:18.185 UTC [12412] LOG:  invalid record length at 1/D160958: wanted 24, got 0
2019-02-04 14:08:18.188 UTC [12412] LOG:  consistent recovery state reached at 1/D160958
2019-02-04 14:08:18.188 UTC [12411] LOG:  database system is ready to accept read only connections
2019-02-04 14:08:18.200 UTC [12412] LOG:  restored log file "00000001000000010000000D" from archive
2019-02-04 14:08:18.242 UTC [12412] LOG:  invalid record length at 1/D160958: wanted 24, got 0
2019-02-04 14:08:18.242 UTC [12412] LOG:  redo done at 1/D160920
2019-02-04 14:08:18.266 UTC [12412] LOG:  restored log file "00000001000000010000000D" from archive
cp: cannot stat '/var/lib/postgresql/10/main/archivedir/00000002.history': No such file or directory
2019-02-04 14:08:18.313 UTC [12412] LOG:  selected new timeline ID: 2
2019-02-04 14:08:18.393 UTC [12412] LOG:  archive recovery complete
cp: cannot stat '/var/lib/postgresql/10/main/archivedir/00000001.history': No such file or directory
2019-02-04 14:08:18.532 UTC [12411] LOG:  database system is ready to accept connections

Best Answer

My database starts just fine for the date 2019-01-27 and I now want to replay WAL files to get the database state to 2018-12-18.

Replay only works forward, never backwards.

If pg_control were they only thing missing from your older backups, it is possible you could manually recreate a suitable copy of pg_control using a newer copy as a template. But chances are other files are also missing.

Your situation seems hopeless. If this data is truly precious, you should hire data-recovery professionals. You are beyond "free help on the internet" territory.