PostgreSQL – Impact of max_standby_archive_delay and max_standby_streaming_delay on Replication

amazon-rdspostgresqlreplication

In my replication servers on a postgresql master-slave replication scheme on aws RDS I get the following error:

SQLSTATE[40001]: Serialization failure: 7 ERROR:  canceling statement due to conflict with recovery

The cause, as far as I understood is that the replication happens simirary like a database migration. A sequence of queries is written something called WAL and then are being executed in a FIFO sequence.

Also, I understood that once you execute queries whilst wal is being executed it can cause conflicts because sometimes the query that is currently is being executed can cause fetching stale data.

So according to documentation there are delays that allow first the current queries to be executed and then apply wal changes. Theese are:

  1. max_standby_archive_delay
  2. max_standby_streaming_delay

But setting these values as -1 on heavy queries (query execution time >30s) will cause the replicas to have stale data for extended period of times?

Best Answer

Yes, that is the idea. In the case of a replication conflict PostgreSQL has only two options:

  1. cancel the query

  2. delay the application of replicated changes.

Setting max_standby_streaming_delay to -1 will delay replication indefinitely long.

There are ways to reduce replication conflicts:

  • Set hot_standby_feedback = on to remove replication conflicts caused by VACUUM. The price you are paying is that long running queries on the standby can bloat your tables.

  • Don't have any statements like DROP TABLE, TRUNCATE, ALTER TABLE and similar in your workload that cause ACCESS EXCLUSIVE locks.

  • In PostgreSQL v12, you can set

    ALTER TABLE atable SET (vacuum_truncate = off);
    

    to disable autovacuum truncation (which also causes a short ACCESS EXCLUSIVE lock). In older versions, there is only the crude and undocumented workaround of setting old_snapshot_threshold to something else than the default value.

This is all complicated and may have undesired effects, so the best advice is this: If you want a standby that does not lag more than necessary, but you also want to run long running queries on a standby server, you should use two standby servers, one for each of these conflicting purposes.