PostgreSQL – Primary Node Disconnects with Automatic Failover

failoverhigh-availabilitypostgresql

OS: CentOS 7.3

PostgreSQL: 9.6

/etc/hosts:

192.168.0.1  node1
192.168.0.2  node2
192.168.0.3  # VIP
192.168.0.4  # VIP

Follow this guide did PostgreSQL replication:

https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps

I used this way to do first initial database copy:

psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* ~postgres/9.6/data/ 192.168.0.2:~postgres/9.6/data/
psql -c "select pg_stop_backup();"

Then edited pg_hba.conf, postgresql.conf to the right IPs.
Then created a recovery.conf file inserted this:

standby_mode = 'on'
primary_conninfo = 'host=192.168.0.1 port=5432 user=rep password=password'
restore_command = 'cp /var/lib/pgsql/9.6/pg_archive/%f %p'
recovery_target_timeline = 'latest'

Follow official guide:

https://wiki.clusterlabs.org/wiki/PgSQL_Replicated_Cluster

Set config.pcs:

pcs cluster cib pgsql_cfg

pcs -f pgsql_cfg property set no-quorum-policy="ignore"
pcs -f pgsql_cfg property set stonith-enabled="false"
pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY"
pcs -f pgsql_cfg resource defaults migration-threshold="1"

pcs -f pgsql_cfg resource create vip-master IPaddr2 \
   ip="192.168.0.3" \
   nic="ens192" \
   cidr_netmask="24" \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="block"

pcs -f pgsql_cfg resource create vip-rep IPaddr2 \
   ip="192.168.0.4" \
   nic="ens192" \
   cidr_netmask="24" \
   meta migration-threshold="0" \
   op start   timeout="60s" interval="0s"  on-fail="stop" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="ignore"

pcs -f pgsql_cfg resource create pgsql pgsql \
   pgctl="/usr/pgsql-9.6/bin/pg_ctl" \
   psql="/usr/pgsql-9.6/bin/psql" \
   pgdata="/var/lib/pgsql/9.6/data/" \
   rep_mode="sync" \
   node_list="node1 node2" \
   restore_command="cp /var/lib/pgsql/9.6/pg_archive/%f %p" \
   primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
   master_ip="192.168.0.4" \
   restart_on_promote='true' \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="4s" on-fail="restart" \
   op monitor timeout="60s" interval="3s"  on-fail="restart" role="Master" \
   op promote timeout="60s" interval="0s"  on-fail="restart" \
   op demote  timeout="60s" interval="0s"  on-fail="stop" \
   op stop    timeout="60s" interval="0s"  on-fail="block" \
   op notify  timeout="60s" interval="0s"

pcs -f pgsql_cfg resource master msPostgresql pgsql \
   master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

pcs -f pgsql_cfg resource group add master-group vip-master vip-rep

pcs -f pgsql_cfg constraint colocation add master-group with Master msPostgresql INFINITY
pcs -f pgsql_cfg constraint order promote msPostgresql then start master-group symmetrical=false score=INFINITY
pcs -f pgsql_cfg constraint order demote  msPostgresql then stop  master-group symmetrical=false score=0

pcs cluster cib-push pgsql_cfg

Load configuration

sh config.pcs

Check status again:

crm_mon -Afr -1

Result:

Stack: corosync
Current DC: node1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
Last updated: Thu Jul 27 07:29:35 2017          Last change: Thu Jul 27 07:13:11 2017 by hacluster via crmd on node1

2 nodes and 4 resources configured

Online: [ node1 node2 ]

Full list of resources:

 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ node2 ]
     Slaves: [ node1 ]
 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Started node2
     vip-rep    (ocf::heartbeat:IPaddr2):       Started node2

Node Attributes:
* Node node1:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : DISCONNECT
    + pgsql-status                      : HS:alone
* Node node2:
    + master-pgsql                      : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000000011000140
    + pgsql-status                      : PRI

Migration Summary:
* Node node1:
* Node node2:

Why node1 can't connected.

And at this time check recovery.conf file on both nodes:

node1:

standby_mode = 'on'
primary_conninfo = 'host=192.168.0.4 port=5432 user=postgres application_name=node1 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'cp /var/lib/pgsql/9.6/pg_archive/%f %p'
recovery_target_timeline = 'latest'

node2:

(null)

Best Answer

Whenever something goes wrong, you have to look at the logs. You will find warnings, error, fatal, and panic messages.

You can find where your logs are in your postgresql.conf file.

Look for the logging_collector setting, if it's on, you will find your server logs in the directory specified in the log_directory setting.

If logging_collector is set to off, look at the log_destination setting. If it's syslog you need to look at your syslog settings to find where your logs are. If it's stderr you might find something under /proc/<PID>/fd/2 where is the PID of your running PostgreSQL server.

You might find this page of documentation usefull.