I have setup streaming replication for my postgres servers. I am using PostgreSQL 9.2.6. Replication seems to be working fine. I am monitoring this with two ways using nagios:
- Log_delay and
- Byte_lag
I am very frequently getting critical alert for log_delay
and in the same time byte_lag
is not throwing up any alert. log_delay
will be OK after 1 or 2 minutes. Can any one please suggest me If I am missing something in my setup?
Queries are given below.
Log_delay
SELECT
CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
Byte_lag
SELECT sent_offset - ( replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 )
AS byte_lag
FROM (
SELECT client_addr,
('x' || lpad(split_part(sent_location, '/', 1), 8, '0'))::bit(32)::bigint
AS sent_xlog,
('x' || lpad(split_part(replay_location, '/', 1), 8, '0'))::bit(32)::bigint
AS replay_xlog,
('x' || lpad(split_part(sent_location, '/', 2), 8, '0'))::bit(32)::bigint
AS sent_offset,
('x' || lpad(split_part(replay_location, '/', 2), 8, '0'))::bit(32)::bigint
AS replay_offset
FROM pg_stat_replication ) AS s;
Best Answer
Since you're using 9.2, it would probably be better to use pg_xlog_location_diff instead of your
byte_lag
function. Something like this is much clearer, less error prone, and should give you what you are looking for.Alternatively, there is always the
check_postgres.pl
Nagios plugin. It has a huge number of implemented checks that can definitely help keep your PostgreSQL installation happy and healthy.The particular function you would want for your needs is hot_standby_delay.
Once you have it installed, and the symlinks made, you could simply do something like this from your primary:
check_postgres_hot_standby_delay --dbport=5432,5432 --dbuser=postgres,postgres --dbhost=,10.0.1.2 --warning="1048576 and 2min" --critical="16777216 and 10min" -v
And get alerts if the bytes and the time falls out of your set thresholds.