Postgresql – How to prevent read replica restarts during high replication lag

aws-aurorapostgresqlreplication

We're running an Aurora PostgreSQL cluster, with a single read replica along with the master node.

Periodically, there is very heavy write load which causes high replication lag. This can cause the read replica to restart which is undesirable for us in a high availability environment. When this happens, clients that are connected to the cluster via the read-only endpoint get this JDBC error: org.postgresql.util.PSQLException: FATAL: the database system is starting up. Additionally, the AWS console shows these peppered throughout the logs:

Read replica has fallen behind the master too much. Restarting postgres.

followed by

DB instance restarted

We can tolerate the read replica being behind by several minutes, but we can't tolerate the read replica restarting to catch up.

Is there a way to prevent the read replica from restarting during these periods?

Alternatively, are there any recommended tweaks for reducing replication lag during periods of heavy write load?

Best Answer

I think this is what they would call "working as designed". It is stated in the documentation for Aurora MySQL (emphasis mine):

The tradeoff with having multiple Aurora Replicas is that replicas become unavailable for brief periods when the underlying database instances are restarted. These restarts can happen during maintenance operations, or when a replica begins to lag too far behind the master. Restarting a replica interrupts existing connections to the corresponding database instance.

The same I expect holds true for Aurora PostgreSQL, because the replication implementation is likely very similar, if not identical.

The reason for this behaviour I believe is explained by the way the changes are propagated from the write instance to the read replicas: the redo records are sent, and the replicas are expected to apply them to the appropriate locally cached pages in the right sequence -- see slide 27 of this presentation.

While the cache update processing should be fast, it can still overwhelm the replica, particularly if it's running on a less capable AWS instance. Once the replica loses track of what cache pages are stale and what aren't, it has no other choice than to start from scratch. I presume the original solution designers elected to use the existing instance startup process for that, instead of developing a new cache invalidation and reload mechanism, especially given that the restart can happen relatively quickly.

In other words, the answer to is there a way to prevent the read replica from restarting seems to be "no".

As to are there any recommended tweaks for reducing replication lag, since the lag depends entirely on the replica's CPU and memory capacity (as no I/O is involved), try scaling up your replica, permanently or at least during these heavy updates.

If you have control over the client application code, you could modify it to handle broken connections more gracefully and retry when needed. If you don't, you could try setting up a proxy (e.g. pgpool) between your clients and the read replica, it might relieve the pain to some extent by proactively testing and reëstablishing connections.

If you find neither of these suggestions workable, you can also contact the AWS support and see if they have any better ideas.