We are running Postgres 9.6, with 10+TB size. Backups have been taken using a homegrown tool "pgrsync", which uses S3 as the repository. Both the backup files and WAL archives are stored on S3.
Problem: While trying to restore, the WAL archives restore randomly fails for some backups.
In this case, the backup start location is 00000002000544C60000006B
and stop location is 00000002000545210000008D
, (based on pg_stop_backup() output) but it stops in between at 005493D and terminates the restore. If I redo the restore again, it stops exactly at the same point. Similar results from couple of more backups, while other backups successfully restores.
It is possibly an indication of some specific WAL files got corrupted during the backup/restore process. Is that the correct interpretation?
Questions:
- Are there any ways of identifying if the WAL file is corrupted?
- Is there a way to move ahead without losing data? (I am wary of using
pg_resetxlog
)
First the backup file (from the WAL location 00000002000544C60000006B.0001E9A0.backup
)
START WAL LOCATION: 544C6/6B01E9A0 (file 00000002000544C60000006B)
STOP WAL LOCATION: 54521/8D235490 (file 00000002000545210000008D)
CHECKPOINT LOCATION: 544C6/BA84E0A8
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2020-11-04 03:24:14 UTC
LABEL: inc04nov
STOP TIME: 2020-11-04 09:21:38 UTC
Then the log files:
2020-11-11 06:38:08 UTC [21731]: [22299-1] user=,db=LOG: restored log file "000000020005451D00000080" from archive
2020-11-11 06:38:08 UTC [21731]: [22300-1] user=,db=LOG: restored log file "000000020005451D00000081" from archive
2020-11-11 06:38:08 UTC [21731]: [22301-1] user=,db=LOG: restored log file "000000020005451D00000082" from archive
2020-11-11 06:38:08 UTC [21731]: [22302-1] user=,db=LOG: restored log file "000000020005451D00000083" from archive
2020-11-11 06:38:08 UTC [21731]: [22303-1] user=,db=LOG: restored log file "000000020005451D00000084" from archive
2020-11-11 06:38:08 UTC [21731]: [22304-1] user=,db=LOG: restored log file "000000020005451D00000085" from archive
2020-11-11 06:38:08 UTC [21731]: [22305-1] user=,db=LOG: restored log file "000000020005451D00000086" from archive
2020-11-11 06:38:08 UTC [21731]: [22306-1] user=,db=LOG: restored log file "000000020005451D00000087" from archive
2020-11-11 06:38:08 UTC [21731]: [22307-1] user=,db=LOG: restored log file "000000020005451D00000088" from archive
2020-11-11 06:38:08 UTC [21731]: [22308-1] user=,db=LOG: restored log file "000000020005451D00000089" from archive
2020-11-11 06:38:08 UTC [21731]: [22309-1] user=,db=LOG: restored log file "000000020005451D0000008A" from archive
2020-11-11 06:38:08 UTC [21731]: [22310-1] user=,db=LOG: restored log file "000000020005451D0000008B" from archive
2020-11-11 06:38:08 UTC [21731]: [22311-1] user=,db=LOG: redo done at 5451D/8AFFE500
2020-11-11 06:38:08 UTC [21731]: [22312-1] user=,db=LOG: last completed transaction was at log time 2020-11-04 09:10:42.219935+00
2020-11-11 06:38:11 UTC [21731]: [22314-1] user=,db=FATAL: WAL ends before end of online backup
2020-11-11 06:38:11 UTC [21731]: [22315-1] user=,db=HINT: All WAL generated while online backup was taken must be available at recovery.
2020-11-11 06:38:13 UTC [21728]: [3-1] user=,db=LOG: startup process (PID 21731) exited with exit code 1
2020-11-11 06:38:13 UTC [21728]: [4-1] user=,db=LOG: terminating any other active server processes
2020-11-11 06:38:16 UTC [4559]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode
2020-11-11 06:38:16 UTC [4561]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode
2020-11-11 06:38:16 UTC [4576]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode
2020-11-11 06:38:25 UTC [21728]: [5-1] user=,db=LOG: database system is shut down
It looks like the WAL file is probably corrupted.
Here is the correct WAL file which restored fine:
-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008A | head -2
rmgr: Heap len (rec/tot): 151/ 151, tx: 3501354263, lsn: 5451D/8A0001D8, prev 5451D/89FFE1E8, desc: INSERT off 15, blkref #0: rel 3435996123/765803221/4171942326 blk 15806513
rmgr: Btree len (rec/tot): 72/ 72, tx: 3501354263, lsn: 5451D/8A000270, prev 5451D/8A0001D8, desc: INSERT_LEAF off 2, blkref #0: rel 3435996123/765803221/4171944289 blk 3881149
Here is the WAL file that actually stopped the archive recovery processing:
-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008B
pg_xlogdump: FATAL: could not find a valid record after 5451D/8B000000
Best Answer
The only explanation for this is that the recovery process never saw a
BACKUP_END
WAL entry, that is, it never read a WAL segment that contains the effect of apg_stop_backup
call.Now you argue convincingly that you have run the function, otherwise you wouldn't have the
backup_label
file that is generated by this function in a non-exclusive backup.Archive recovery won't allow you to skip over a WAL segment during recovery, so it is impossible that recovery skipped that segment.
That leaves a few explanations:
You used a
backup_label
file from some other backup because something got mixed up.You restored a WAL segment with the same name from a different cluster that did not contain the
BACKUP_END
entry.You got mixed up with timelines, and there was a timeline switch during the backup, so the
BACKUP_END
is actually in00000003000545210000008D
or so.(I am not sure if that is possible or if a timeline switch will break an online backup; I didn't test.)
If everything is like you expect it to be, then
00000002000545210000008D
must contain aBACKUP_END
entry. Verify that withAs soon as this entry is processed, PostgreSQL will emit the log line