Postgresql – Automating failover in PostgreSQL 9.1

failoverpostgresql

How does one setup two identical servers for automatic failover in PostgreSQL 9.1.

OS

Centos 5
PostgreSQL 9.1 compiled from source
The postgres user account exists on both machines and has a ssh passwordless key to connect to both machines.

My Current Setup :

Master server configuration:

postgresql.conf:

listen_address = '*'
wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 16    
wal_keep_segments = 8 
archive_mode = on    
archive_command = 'cp "%p" /opt/pgsql91/archive/"%f"'  

pg_hba.conf:

 host  replication   all   10.0.66.1/32      trust
 host  replication   all   10.0.66.2/32      trust

Standby Server

postgresql.conf and pg_hba.conf are identical to what is configured on the master server.

recovery.conf:

 standby_mode = 'on'
 primary_conninfo = 'host=10.0.66.1'
 trigger_file = '/opt/pgsql91/data/trigger.txt'

Thanks to hzRoot, I now understand how to switch the server from standby to master.

Using the following commands, I can synchronize the new slave with the new master and then get replication backup and running.

On the new master (10.0.66.2)

  1. su – postgres
  2. touch trigger.txt in /opt/pgsql91/data/
  3. recovery.conf becomes recovery.done
  4. psql -c ";SELECT pg_start_backup('backup', true)";
  5. rsync -a -v -e ssh /opt/pgsql91/data/ 10.0.66.1:/opt/pgsql91/data/ –exclude postmaster.pid
  6. psql -c ";SELECT pg_stop_backup()";

On the new slave (10.0.66.1)

  1. create the recovery.conf : cp recovery.done to recovery.conf
  2. vi recovery.conf change ip address : primary_conninfo = 'host=10.0.66.2'
  3. start postgresql

So my questions are now :

  1. Is this the correct way to switch roles?
  2. Has anyone automated this process, if so what did you do?
  3. If synchronous replication is enabled, I noticed the new master server won't commit any transactions because it is waiting for the slave to respond. There is no slave however because the other server, the old master is down. Is this correct or do I need to temporarily disable synchronous replication while the new slave is down?

Best Answer

Check out repmrg:

repmgr is a set of open source tools that helps DBAs and System administrators manage a cluster of PostgreSQL databases..

By taking advantage of the Hot Standby capability introduced in PostgreSQL 9, repmgr greatly simplifies the process of setting up and managing database with high availability and scalability requirements.

repmgr simplifies administration and daily management, enhances productivity and reduces the overall costs of a PostgreSQL cluster by:

  • monitoring the replication process; allowing DBAs to issue high
  • availability operations such as switch-overs and fail-overs.

It does two things:

  1. repmgr: command program that performs tasks on your cluster and then exits
  2. repmgrd: management and monitoring daemon that watches the cluster and can automate remote actions.

For automatic failover, repmgrd does the trick and is not a SPOF in your network, like pgPool. However, it is still important to monitor all deamons and bring them back up after failure.

Version 2.0 is about to be released, including RPM's.