PostgreSQL – Primary Node Disconnects with Automatic Failover

failoverhigh-availabilitypostgresql

OS: CentOS 7.3

PostgreSQL: 9.6

/etc/hosts:

192.168.0.1  node1
192.168.0.2  node2
192.168.0.3  # VIP
192.168.0.4  # VIP

Follow this guide did PostgreSQL replication:

https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps

I used this way to do first initial database copy:

psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* ~postgres/9.6/data/ 192.168.0.2:~postgres/9.6/data/
psql -c "select pg_stop_backup();"

Then edited pg_hba.conf, postgresql.conf to the right IPs.
Then created a recovery.conf file inserted this:

standby_mode = 'on'
primary_conninfo = 'host=192.168.0.1 port=5432 user=rep password=password'
restore_command = 'cp /var/lib/pgsql/9.6/pg_archive/%f %p'
recovery_target_timeline = 'latest'

Follow official guide:

https://wiki.clusterlabs.org/wiki/PgSQL_Replicated_Cluster

Set config.pcs:

pcs cluster cib pgsql_cfg

pcs -f pgsql_cfg property set no-quorum-policy="ignore"
pcs -f pgsql_cfg property set stonith-enabled="false"
pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY"
pcs -f pgsql_cfg resource defaults migration-threshold="1"

pcs -f pgsql_cfg resource create vip-master IPaddr2 \
   ip="192.168.0.3" \
   nic="ens192" \
   cidr_netmask="24" \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="block"

pcs -f pgsql_cfg resource create vip-rep IPaddr2 \
   ip="192.168.0.4" \
   nic="ens192" \
   cidr_netmask="24" \
   meta migration-threshold="0" \
   op start   timeout="60s" interval="0s"  on-fail="stop" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="ignore"

pcs -f pgsql_cfg resource create pgsql pgsql \
   pgctl="/usr/pgsql-9.6/bin/pg_ctl" \
   psql="/usr/pgsql-9.6/bin/psql" \
   pgdata="/var/lib/pgsql/9.6/data/" \
   rep_mode="sync" \
   node_list="node1 node2" \
   restore_command="cp /var/lib/pgsql/9.6/pg_archive/%f %p" \
   primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
   master_ip="192.168.0.4" \
   restart_on_promote='true' \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="4s" on-fail="restart" \
   op monitor timeout="60s" interval="3s"  on-fail="restart" role="Master" \
   op promote timeout="60s" interval="0s"  on-fail="restart" \
   op demote  timeout="60s" interval="0s"  on-fail="stop" \
   op stop    timeout="60s" interval="0s"  on-fail="block" \
   op notify  timeout="60s" interval="0s"

pcs -f pgsql_cfg resource master msPostgresql pgsql \
   master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

pcs -f pgsql_cfg resource group add master-group vip-master vip-rep

pcs -f pgsql_cfg constraint colocation add master-group with Master msPostgresql INFINITY
pcs -f pgsql_cfg constraint order promote msPostgresql then start master-group symmetrical=false score=INFINITY
pcs -f pgsql_cfg constraint order demote  msPostgresql then stop  master-group symmetrical=false score=0

pcs cluster cib-push pgsql_cfg

Load configuration

sh config.pcs

Check status again:

crm_mon -Afr -1

Result:

Stack: corosync
Current DC: node1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
Last updated: Thu Jul 27 07:29:35 2017          Last change: Thu Jul 27 07:13:11 2017 by hacluster via crmd on node1

2 nodes and 4 resources configured

Online: [ node1 node2 ]

Full list of resources:

 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ node2 ]
     Slaves: [ node1 ]
 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Started node2
     vip-rep    (ocf::heartbeat:IPaddr2):       Started node2

Node Attributes:
* Node node1:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : DISCONNECT
    + pgsql-status                      : HS:alone
* Node node2:
    + master-pgsql                      : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000000011000140
    + pgsql-status                      : PRI

Migration Summary:
* Node node1:
* Node node2:

Why node1 can't connected.

And at this time check recovery.conf file on both nodes:

node1:

standby_mode = 'on'
primary_conninfo = 'host=192.168.0.4 port=5432 user=postgres application_name=node1 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'cp /var/lib/pgsql/9.6/pg_archive/%f %p'
recovery_target_timeline = 'latest'

node2:

(null)

Best Answer

Whenever something goes wrong, you have to look at the logs. You will find warnings, error, fatal, and panic messages.

You can find where your logs are in your postgresql.conf file.

Look for the logging_collector setting, if it's on, you will find your server logs in the directory specified in the log_directory setting.

If logging_collector is set to off, look at the log_destination setting. If it's syslog you need to look at your syslog settings to find where your logs are. If it's stderr you might find something under /proc/<PID>/fd/2 where is the PID of your running PostgreSQL server.

You might find this page of documentation usefull.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

PostgreSQL Hotstandby – How to Configure Automatic Failover?

Failover - automatic or otherwise - isn't provided directly by PostgreSQL.

You'll need external tools like repmgr.

The newly released repmgr 2.0 supports autofailover. However, I recommend thinking very hard about whether you should actually use it. Automated manually-triggered failover is usually a LOT safer.

If you do choose to use autofailover, you should make sure to force autofailovers as part of normal operation to make sure it's really working in your environment. Otherwise it's just a false sense of security, and when something actually breaks it'll probably bring the whole system down in a mess.

You should test any failover setup regularly, of course, but at least with manually triggered failover you're aware it's going to happen and ready to step in if there are problems.

It's not just a matter of promoting the server and rebuilding the old master as a replica, either. You've also got to deal with the need to change client connections over to the new server (using something like PgBouncer, PgPool-II, or HAProxy). You also need to reliably terminate the old server so that you don't get divergence, where both servers think they're masters and some clients are writing to each of them.

There's a fair bit of info on all this in the PostgreSQL documentation and on the wiki, so I suggest you start there.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

PostgreSQL Hotstandby – How to Configure Automatic Failover?

Related Question