Postgresql 9.6 HA issues

high-availabilitypostgresqlpostgresql-9.6

(Editing the question with additional information: )

We are in process of testing a Postgres 9.6 active-passive HA cluster using streaming replication on Rhel 7.5. The HA setup is done using pacemaker and Failover is automated.

We are seeing below issue with the setup :

  1. When a failover is triggered (while data is being added to the primary) by killing primary (killall -9 postgres), standby node takes role of primary and the failed node(old primary) is restarted. However it doesn't start streaming or take stand-by role. It starts in disconnect mode.

    • We are using Postgres 9.6 because of application specific requirement – other component's dependency on 9.6 version

    • One thing we checked and noticed that the wal receiver process is not started on the new standby. If we kill postgres processes again the wal receiver process is started and standby node now syncs with new primary.

    • Another observation is that in the disconnected node, there is some additional data, which is not present on the new stand-alone standby. Which means that the failed primary was ahead in terms of log.

Need urgent inputs in :

  • Is there any way of simulating abrupt shutdown of postgres?

  • How can we prevent the situation of data mismatch with "killall -9" on Primary that is prevent old primary having additional data when postgres is killed and failover happens. We are running streaming replication in Sync mode

Thanks !

Best Answer

For questions like that, you'll have to consult the log files.

But the likely explanation is that the primary server was ahead of the standby when it crashed (this can also happen with synchronous replication). Then it cannot act as standby to the newly promoted server.

There are some solutions:

  • Easiest, but takes long: Reinitialize the server with pg_basebackup.

  • Faster, but more complicated: get the new primary in backup mode and use rsync.

  • Easiest, but requires that wal_keep_segments is high enough: pg_rewind the failed server.

You'll have to implement this in the failover scripts.

Why don't you use Patroni which does all that out of the box?