Postgresql – How does the recovery strategy of PostgreSQL exactly work

postgresqlpostgresql-11recovery

I'm testing and trying to figure out how the backup and recovery strategy in PostgreSQL 11 works but it does not work as expected. When I recover the backup, I don't get the correct state. I get any state after a given timestamp. In detail:

I have this setup in database:

wal_level = replica 
max_wal_senders = 10
archive_mode = on

I have executed the following commands:

postgres=# select pg_create_physical_replication_slot('slot1');
pg_basebackup -D /media/extern/postgresql_basebackup/
pg_receivewal -D /media/extern/postgresql_archive -S slot1 -v

pg_receivewal is writing into postgresql_archive, this seems to work.

  • then I insert something into the database, let's say at 16:21.
  • I wait some minutes, let's say until 16:27 and delete all the insert.

Then I want to recover the inserts by doing…

service postgresql stop

(do I need to stop pg_receivewal as well)?

# mv main main.before_recovery
# cp -rp /media/extern/postgresql_basebackup main

nano /var/lib/postgresql/11/main/recovery.conf

I add the following content

restore_command = 'cp -r /media/extern/postgresql_archive/%f %p'
recovery_target_time='2018-01-06 16:22:00'

Then I execute

# chown postgres:postgres recovery.conf 
# chmod 700 recovery.conf 

service postgresql start

After startup, the file changes to recovery.done, but the state in database is wrong. It didn't restore the data. The database is still empty, so the PITR didn't work, why?

What should I do In case if everything was successful after recovery?. I'm unsure about these questions:

  • Creating a new slot2 by executing pg_create_physical_replication_slot('slot2'); and streaming from this slot by pg_receivewal -D /media/extern/postgresql_archive -S slot2 -v?
  • Doing a pg_receivewal -D /media/extern/postgresql_archive -S slot1 -v again?
  • Do I have to delete anywhen any data in postgresql_archive? (-> Probably not)
  • When do I have to create another base backup?
  • When and why do I execute select pg_switch_wal(); in this recovery concept?
  • How do I deal with new 'timelines' after recovery? I understand the concept of the timelines, but I'm unsure how I should proceed with it after recovery.

[UPDATE]
jjanes pointed out, that the date is wrong. I corrected it, but I still get an empty database after recovery. I tried these steps after I recovered the database, which is now empty:

2019-01-08, 18:50: create a new base backup from this empty database
to have a good starting point.

2019-01-08, 18:55 created slot1

pg_receivewal -D /media/extern/postgresql_archive -S slot1 -v

2019-01-08, 18:57 insert

2019-01-08, 19:38 delete

service postgresql stop

# mv main main.before_recovery
# cp -rp /media/extern/postgresql_basebackup main
# chmod 700 main

# nano /var/lib/postgresql/11/main/recovery.conf

restore_command = 'cp -r /media/extern/postgresql_archive/%f %p'
recovery_target_time='2019-01-08 19:20:00'

# chown postgres:postgres recovery.conf 
# chmod 700 recovery.conf 

# service postgresql start

(btw: I saw that it starts by itself new timelimes after recovery.)

[UPDATE]
I activated in postgresql.conf

checkpoint_flush_after = 256kB 
checkpoint_timeout = 30s

I tried to restore the database by using the previous method but without recovery_target_time and without any DELETE statement. I just wanted to look if the current full state can be rebuilt at all. Additionally I renamed the current wal file 00000007000000000000002D.partial to 00000007000000000000002D after stopping the database and before recovering the data.

I'm currently not sure which of these things worked, but at least, the non-PITR recovery seem to work, because I was able to restore the latest data by installing the (empty) base backup and the wal files. At least a step into the right direction, but this only works when you don't miss any shipped wal files. Here, question 2 comes into play: What is the correct workflow to handle the "next shipping iteration".
At first, you can't start pg_receivewal -D /media/extern/postgresql_archive -S slot1 -v
again without deleting the shipped WALs in the external folder. Otherwise it would not start shipping the logs and it would abort with connection errors. This means that you can throw away your old base backup, because when you try the recover again with the old base backup, you would miss all the wal logs between base backup and last recovery point (because you deleted them). That also means that you have to create a new base backup immediately from the recovery point state. When you start log shipping again, you have no "gap" then between the new base backup and the wal files.

[UPDATE]
I got PITR running with the same approach, so I was able to "undo" a delete now. After recovery and service postgresql start, the database remains in read only mode. I had to execute select pg_wal_replay_resume(); to make it RW again. As already mentioned, I can't do WAL shipping again. When I don't modify the archive folder and try to restart the WAL to ship inside the same folder, I get the following exception:

pg_receivewal: could not send replication command "START_REPLICATION":
ERROR: requested starting point 0/34000000 on timeline 11 is not in
this server's history DETAIL: This server's history forked from
timeline 11 at 0/3310ADA8.

Sure, I could do log shipping to a new archive folder or something, but I guess this is not the professional way to do that. What is the common used strategy to resume log shipping after recovery?

Best Answer

I've done a test following these guidelines:

1) create a new base backup  
2) create slot  
3) launch pg_receivewal -D PATH -S slotname -v
4) delete records
5) postgresql stop / delete PGDATA
6) cp -rp postgresql_basebackup PGDATA
7) edit recovery.conf
   restore_command = 'cp -ir /PATH/%f %p'
   recovery_target_time=timestamp
8) postgresql start  

and PITR worked as aspected. Following what I did with comments. I have a fresh postgres installation where I created a db (test) and one table t1 whith 200 rows created with a simple loop. Contents:

postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# \d t1
                    Table "public.t1"
 Column |         Type          | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
 col1   | character varying(20) |           |          |

test=# select * from t1 limit 5;
 col1
-------
 1ABCD
 2ABCD
 3ABCD
 4ABCD
 5ABCD
(5 rows)

test=# select count(1) from t1;
count
-------
   200
(1 row)

STEP 1), 2), 3)

First start from a clean situation:

    postgres=# select now();
              now
-------------------------------
 2019-01-11 15:19:19.441022+01
(1 row)

postgres=# select * from pg_catalog.pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
(0 rows)

postgres=# select pg_switch_wal();
 pg_switch_wal
---------------
 0/B000190
(1 row)

$ cd /PGDATA-11/pg_wal/
$ ls -lrt
total 32768
drwx------ 2 postgres postgres        6 Jan 11 15:15 archive_status
-rw------- 1 postgres postgres 16777216 Jan 11 15:20 000000010000000000000001
-rw------- 1 postgres postgres 16777216 Jan 11 15:20 000000010000000000000002

Create a backup:

$ /usr/pgsql-11/bin/pg_basebackup -D /home/postgres/HOT-BACKUP/  

postgres=# select pg_create_physical_replication_slot('test1');  
pg_create_physical_replication_slot
-------------------------------------
 (test1,)
(1 row)  

$ /usr/pgsql-11/bin/pg_receivewal -D /home/postgres/BCK-PITR-TEST/ -S test1 -v -n &
[1] 23350
$ pg_receivewal: starting log streaming at 0/4000000 (timeline 1)

I started pg_receivewall with "-n" to make it stop when postgres server goes down, and put it in background.

$ ls -rlt /home/postgres/BCK-PITR-TEST/
total 16384
-rw------- 1 postgres postgres 16777216 Jan 11 15:29 000000010000000000000004.partial

STEP 4)

test=# select now();
          now
-------------------------------
 2019-01-11 15:31:24.201177+01
(1 row)

test=# select count(1) from t1 where col1 like ('1%');
 count
-------
   111
(1 row)

SLEEP two minutes.........

test=# select now();
              now
-------------------------------
 2019-01-11 15:33:24.886394+01
(1 row)

test=# delete from t1 where col1 like '1%';
DELETE 111
test=#

STEP 5)

$ /usr/pgsql-11/bin/pg_ctl stop -D /PGDATA-11/  
waiting for server to shut down....pg_receivewal: not renaming "000000010000000000000004.partial", segment is not complete
pg_receivewal: replication stream was terminated before stop point
pg_receivewal: disconnected
 done
server stopped
[1]+  Exit 1                  /usr/pgsql-11/bin/pg_receivewal -D /home/postgres/BCK-PITR-TEST/ -S test1 -v -n

Note how pg_receivewal stops itself because of parameter "-n" during start.
Let's check the contents in the most recent WAL (XXX4):

 $ /usr/pgsql-11/bin/pg_waldump 000000010000000000000004 000000010000000000000004
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/04000028, prev 0/030000F8, desc: RUNNING_XACTS nextXid 572 latestCompletedXid 571 oldestRunningXid 572
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/04000060, prev 0/04000028, desc: RUNNING_XACTS nextXid 572 latestCompletedXid 571 oldestRunningXid 572
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/04000098, prev 0/04000060, desc: CHECKPOINT_ONLINE redo 0/4000060; tli 1; prev tli 1; fpw true; xid 0:572; oid 24576; multi 1; offset 0; oldest xid 561 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 572; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/04000108, prev 0/04000098, desc: RUNNING_XACTS nextXid 572 latestCompletedXid 571 oldestRunningXid 572
rmgr: Heap        len (rec/tot):     59/  7283, tx:        572, lsn: 0/04000140, prev 0/04000108, desc: DELETE off 1 KEYS_UPDATED , blkref #0: rel 1663/16384/16385 blk 0 FPW
rmgr: Heap        len (rec/tot):     54/    54, tx:        572, lsn: 0/04001DB8, prev 0/04000140, desc: DELETE off 10 KEYS_UPDATED , blkref #0: rel 1663/16384/16385 blk 0
rmgr: Heap        len (rec/tot):     54/    54, tx:        572, lsn: 0/04001DF0, prev 0/04001DB8, desc: DELETE off 11 KEYS_UPDATED , blkref #0: rel 1663/16384/16385 blk 0
...........................................................................................................................................................................
rmgr: Transaction len (rec/tot):     34/    34, tx:        572, lsn: 0/040035E0, prev 0/040035A8, desc: COMMIT 2019-01-11 15:33:31.342763 CET

In the last line, you can see the commit timestamp for the delete: COMMIT 2019-01-11 15:33:31.342763 CET. This is the timestamp we need to avoid to reach, in order to reject the delete statement. So the recovery.conf will be edited with a timestamp sometimes before this one.
After deleting PGDATA and restoring backup files from HOT-backup path into PGDATA, I edited recovery.conf and restarted postgres.

STEPS 6), 7), 8):

$ cp -rp /home/postgres/HOT-BACKUP/* /PGDATA-11/
$ vi recovery.conf

$ cat recovery.conf
restore_command = 'cp -ir /home/postgres/BCK-PITR-TEST/%f %p'
recovery_target_time='2019-01-11 15:31:24.201177+01'

$ /usr/pgsql-11/bin/pg_ctl start -D /PGDATA-11/ &

Look at the server log, I left only important lines:

2019-01-11 15:56:46.873 CET [28768] LOG:  database system was interrupted; last known up at 2019-01-11 15:27:52 CET
2019-01-11 15:56:46.885 CET [28768] LOG:  starting point-in-time recovery to 2019-01-11 15:31:24.201177+01
2019-01-11 15:56:46.911 CET [28768] LOG:  redo starts at 0/3000028
2019-01-11 15:56:46.913 CET [28768] LOG:  consistent recovery state reached at 0/30000F8
2019-01-11 15:56:46.914 CET [28766] LOG:  database system is ready to accept read only connections
2019-01-11 15:56:46.931 CET [28768] LOG:  restored log file "000000010000000000000004" from archive
2019-01-11 15:56:46.951 CET [28768] LOG:  recovery stopping before commit of transaction 572, time 2019-01-11 15:33:31.342763+01
2019-01-11 15:56:46.951 CET [28768] LOG:  recovery has paused
2019-01-11 15:56:46.951 CET [28768] HINT:  Execute pg_wal_replay_resume() to continue.

We can see that the recovery process reached a consistent state at lsn 0/30000F8, which is in WAl file XXX3: following WAL XXX4 has been copied but not applied because of the timestamp I configured in recovery.conf.

Of course the rows deleted are now present into the db, so I executed pg_wal_replay_resume() to bring the db RW again:

test=# select count(1) from t1;
 count
-------
    200
(1 row)

test=# select pg_wal_replay_resume();
 pg_wal_replay_resume
----------------------

(1 row)

Now wal files moved to timeline 2:

$  ls -lrt
total 49156
-rw------- 1 postgres postgres 16777216 Jan 11 15:54 000000020000000000000005
-rw------- 1 postgres postgres 16777216 Jan 11 15:56 000000010000000000000004
drwx------ 2 postgres postgres       42 Jan 11 15:56 archive_status
-rw------- 1 postgres postgres       50 Jan 11 15:59 00000002.history
-rw------- 1 postgres postgres 16777216 Jan 11 15:59 000000020000000000000004

I did a similar test using an archive_command insted of pg_receivewal and obtained identical results. I suggest you to put the db log in debug, and to check with pg_waldump what is inside WAL files. Something strange is happening to you, and it could be interesting catch out WHAT is happening.

Hope this help. MarcoP

Some answers added on 15 Jan.

What should I do In case if everything was successful after recovery?. I'm unsure about these questions:

Creating a new slot2 by executing pg_create_physical_replication_slot('slot2'); and streaming from this slot by pg_receivewal -D /media/extern/postgresql_archive -S slot2 -v?

This could be a choice. It's up to you to decide which techology best fit your requirements. We do not user pg_receivewal at all, we prefer to use an archive_command.

Doing a pg_receivewal -D /media/extern/postgresql_archive -S slot1 -v again?

No, you cant! replication slost are not included in backup by pg_basebackup. If you look at serverlog during basebackup you will see:

2019-01-14 16:32:02.886 CET sending backup "pg_basebackup base backup"     5c3cab72.27fc-12 4/0-0 [10236] DEBUG:  contents of directory "pg_replslot"   excluded from backup  

so, when you restore the PGDATA from hot-backup you will not restore any replication slot.

When do I have to create another base backup?

It really depends on your requirements. What are your constraints about Recovery Time Objective and Recovery Point Objective?

How many Wal files your db switches in a single day?

From the doc: Since you have to keep around all the archived WAL files back to your last base backup, the interval between base backups should usually be chosen based on how much storage you want to expend on archived WAL files. You should also consider how long you are prepared to spend recovering, if recovery should be necessary — the system will have to replay all those WAL segments, and that could take awhile if it has been a long time since the last base backup.

When and why do I execute select pg_switch_wal(); in this recovery concept?

This command basically force wal file to shitch to the following one. This grants that the last completed wal is moved to archive dir. In case of disaster where the live PGDATA is lost (and so the pg_wal subdir) this can be important. Wal file are 16MB by default. archive_command copies the wal to the archive destination when switch occurs.

If a disaster happens, let say a single byte before the switch, you lose 16MB of transaction activity. This is why someone prefer pg_receivewal which does not wait for wal completition. BTW you can user archive_timeout to set a value which is compatible with your requirements.