Postgresql Streaming replication – monitoring

monitoringpostgresqlpostgresql-9.2replication

I have setup streaming replication for my postgres servers. I am using PostgreSQL 9.2.6. Replication seems to be working fine. I am monitoring this with two ways using nagios:

  1. Log_delay and
  2. Byte_lag

I am very frequently getting critical alert for log_delay and in the same time byte_lag is not throwing up any alert. log_delay will be OK after 1 or 2 minutes. Can any one please suggest me If I am missing something in my setup?

Queries are given below.

Log_delay

SELECT  
    CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()  
    THEN 0  
    ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) 
    END AS log_delay; 

Byte_lag

SELECT sent_offset - ( replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) 
       AS byte_lag 
FROM ( 
    SELECT client_addr,
           ('x' || lpad(split_part(sent_location,   '/', 1), 8, '0'))::bit(32)::bigint 
           AS sent_xlog,
           ('x' || lpad(split_part(replay_location, '/', 1), 8, '0'))::bit(32)::bigint  
           AS replay_xlog,
           ('x' || lpad(split_part(sent_location,   '/', 2), 8, '0'))::bit(32)::bigint  
           AS sent_offset,
           ('x' || lpad(split_part(replay_location, '/', 2), 8, '0'))::bit(32)::bigint 
           AS replay_offset 
FROM pg_stat_replication ) AS s;

Best Answer

Since you're using 9.2, it would probably be better to use pg_xlog_location_diff instead of your byte_lag function. Something like this is much clearer, less error prone, and should give you what you are looking for.

SELECT client_hostname,
       client_addr, 
       pg_xlog_location_diff(sent_location, replay_location) AS difference_in_bytes
FROM pg_stat_replication;

Alternatively, there is always the check_postgres.pl Nagios plugin. It has a huge number of implemented checks that can definitely help keep your PostgreSQL installation happy and healthy.

The particular function you would want for your needs is hot_standby_delay.

Once you have it installed, and the symlinks made, you could simply do something like this from your primary:

check_postgres_hot_standby_delay --dbport=5432,5432 --dbuser=postgres,postgres --dbhost=,10.0.1.2 --warning="1048576 and 2min" --critical="16777216 and 10min" -v

And get alerts if the bytes and the time falls out of your set thresholds.