Postgresql – How to identify all WALs created during start_backup and stop_backup

postgresql

I'm implementing incremental backups using filesystem-level snapshots. pg_start_backup() and pg_stop_backup() can be used to this end. Via these pair of functions or via the backup_label file, I can identify the "START WAL LOCATION" and "STOP WAL LOCATION" and associated files. The problem is, when these files are not the same and not strictly consecutive (ie, there are intervening files), how do I deterministically determine them? While it might be the case that the files are monotonically increasing, how can I be sure that is the case, and how can I be sure I've located all the intervening files? Can I be guaranteed the files will be written or changed in chronological order? Or in lexicographical order? (From what I've seen, neither is the case.)

After 3x reading the documentation, I can't get any clarity or resolution on this: it tells us that the backup process must include these files, but offers no indication on how to identify them! If I've overlooked a page in the documentation, I'd appreciate being it pointed out.

BTW: Please don't tell me to use pg_basebackup unless you want a kick in the face.

BTW2: I'm trying to make a complete backup set without doing a full backup. Barman stores the archive logs somewhere else, which is what I want to avoid. Even if I implemented barman (which I don't want to), I'd still need to know which files I need for the backup during its operation.

Best Answer

The file names are 24 hex digits long, broken into 3 fields of 8 digits each. The first is the timeline, which you probably don't need to worry about here. The other two are called the "segment" and the "file". They are really a single counter, the "file" rolls over by incrementing the "segment" by one and going back to zero.

Oddly, only the last two digits of the "file" chunk are ever non-zero. (This is apparently so that the offset into a logical 'segment' fits in a 32 bit uint, which was important at some point in the distant past)

In 9.1 and 9.2, the "file" skips FF. So from FE, it rolls back over to 00 while incrementing the segment number. In 9.3, it switches so that FF is used. So the sequence for your version would go:

0000000100000000000000FD
0000000100000000000000FE
000000010000000100000000
000000010000000100000001
...
00000001000000BE000000F9
00000001000000BE000000FA
00000001000000BE000000FB
00000001000000BE000000FC
00000001000000BE000000FD
00000001000000BE000000FE
00000001000000BF00000000
00000001000000BF00000001
00000001000000BF00000002
00000001000000BF00000003
00000001000000BF00000004

Getting the files from pg_xlog is tricky, because those files get recycled (i.e. renamed to have the name of some future file-name, then overwritten once it gets around to needing that file). This means that the timestamps on the files are all screwy. A recycled log file will have the name of some future file, but the contents and timestamp of some past file.

If a file was recycled before you had a chance to copy it, then your backup has failed and you need to try again with a higher wal_keep_segments.

And of course if you copy the "current" file before it was finalized, it will be missing the changes that occurred after the point you copied it.

It is much easier to get them from the wal archive, because that doesn't have the recycling problem and because they don't show up at all until they are finalized.