Postgresql replication slot used by barman has a long replay_lag

barmanpostgresql

I set up barman using wal streaming, I just notice I have a hug replay_lag. I'd like to keep it down to 0 but have no idea about how to do this.

I also have a database replica and it is working fine.

I have barman 2.7 and postgresl 10.8

When I check the postgresql replication status I found this:

select * from pg_stat_replication;

-[ RECORD 1 ]----+------------------------------
pid              | 23095
usesysid         | 169593
usename          | repmgr
application_name | replication_server
client_addr      | xxx.16.2.66
client_hostname  | replica_server
client_port      | 51164
backend_start    | 2019-07-05 23:03:03.194165-05
backend_xmin     |
state            | streaming
sent_lsn         | 22/30884870
write_lsn        | 22/30884870
flush_lsn        | 22/30884870
replay_lsn       | 22/30884870
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
-[ RECORD 2 ]----+------------------------------
pid              | 6689
usesysid         | 66019
usename          | streaming_barman
application_name | barman_receive_wal
client_addr      | xxx.172.16.109
client_hostname  | barman_server
client_port      | 40680
backend_start    | 2019-07-16 00:00:06.903489-05
backend_xmin     |
state            | streaming
sent_lsn         | 22/30884870
write_lsn        | 22/30884870
flush_lsn        | 22/30000000
replay_lsn       |
write_lag        | 00:00:03.01204
flush_lag        | 00:00:01.085313
replay_lag       | 583:31:21.110173
sync_priority    | 0
sync_state       | async 

select * from pg_replication_slots ;

-[ RECORD 1 ]-------+----------------
slot_name           | barman_slot
plugin              |
slot_type           | physical
datoid              |
database            |
temporary           | f
active              | t
active_pid          | 6689
xmin                |
catalog_xmin        |
restart_lsn         | 22/30000000
confirmed_flush_lsn |

barman status master_server

Server master_server:
        Description: master_server - streaming
        Active: True
        Disabled: False
        PostgreSQL version: 10.8
        Cluster state: in production
        pgespresso extension: Not available
        Current data size: 11.4 GiB
        PostgreSQL Data directory: /var/lib/postgresql/10/main
        Current WAL segment: 000000010000002200000030
        PostgreSQL 'archive_command' setting: barman-wal-archive barman_server master_server %p
        Last archived WAL: 00000001000000220000002F, at Fri Aug  9 06:56:05 2019
        Failures of WAL archiver: 63 (000000010000001C00000062 at Mon Jul 15 23:59:21 2019)
        Server WAL archiving rate: 2.60/hour
        Passive node: False
        Retention policies: enforced (mode: auto, retention: RECOVERY WINDOW OF 1 MONTHS, WAL retention: MAIN)
        No. of available backups: 1
        First available backup: 20190809T063454
        Last available backup: 20190809T063454
        Minimum redundancy requirements: satisfied (1/1)

Can anyone point me to a resource I can check to figure out how to fix this Failures of WAL archiver and replay_lag?

Thanks in advance

Best Answer

This large replay lag in pg_stat_replication seems to belong to a pg_receivewal process. Now pg_receivewal writes a copy of the WAL files, but it does not apply them anywhere. Consequently, it will not report back to the primary server that WAL was applied.

This is perfectly normal; compare commit fd7d387e05.

The “Failures of WAL archiver” are also nothing you have to worry about. This number probably comes from pg_stat_archiver and indicates that there has been a problem archiving WALs in the past. This problem must have been resolved, because “Last archived WAL” indicates that more recent WAL files have been archived successfully.

PostgreSQL won't skip archiving WALs — if archiving fails, the archiver gets stuck at that place and retries until successful.