Patroni – Handling Long Disconnected Replica from Primary

high-availabilitypatronipostgresqlreplication

Let's say I am using asynchronous streaming replication with the below configuration in a 3 node cluster with Postgres 10.4 and Patroni 1.4.4

    bootstrap:
      dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
          use_pg_rewind: true
          use_slots: true
          parameters:
            max_wal_senders: 10
            wal_keep_segments: 100
            max_replication_slots: 10

Let's assume that one of the replica nodes suddenly loses its connection to primary for a long time.

  1. In this case I think the size of WAL on the primary will keep on growing as it is not being consumed by the disconnected replica's replication slot. So is there any setting in patroni configuration which will remove the replica and remove its replication slot if it is disconnected from primary for x time duration?
  2. What is the recommended way to handle this case?

Best Answer

I would assume you are monitoring your DB cluster health, so a missing replica would pop up very soon. Also, it is a must to monitor disk space (running out of it might bring you into a situation that is not very easy to solve), so that would also catch this (later than sooner, usually).

Once you discover you have a replica that fell back, you have to investigate why it did so, and fix it - or remove the host from Patroni altogether. If under disk space pressure, remove the replication slot to free up WAL space. In a cloud setup, often simply terminating the host will solve all this by bringing up a new host. In any case, once you have a functioning host, you might have to reinit the Patroni node.

On the other hand, I'm afraid currently there is no mechanism for fencing off replicas that doesn't appear to come back (be it any actual implementation from removing the replication slot to anything more complex than that).