RDS Postgres oldest replication lag increases during inactivity even though Fivetran connector keeps synchronizing

amazon-rdspostgresqlpostgresql-13replication

I'm in the process of setting-up a datawarehouse using Fivetran as the ETL layer. One of the source databases is a AWS RDS Postgres instance.

I've configured the Postgres instance to perform Logical Replication of the WAL using the test_decoding plugin. All seems to work fine during office hours, however outside office hours when no queries are preformed, the oldest replication slot lag size is increasing despite the fact the Fivetran connector performs a sync.

You can see this in the picture below. In the red box you the replication slot lag size is increasing (top graph) while the synchronization moments keeps happing every hour (bottom graph). I would expect a graph like shown in the green box, were the replication slot lag size is decreasing around the synchronization moments.

replication_log_size_staging_vs_fivetran_log

I contacted Fivetran with this issue, but they were yet unable to find out the problem, therefore I'm asking the community.

I'm using Postgres version 13.3 with the following custom configuration:

  • max_slot_wal_keep_size: 20000
  • rds.logical_replication: 1
  • wal_sender_timeout: 0 (required by Fivetran)

The rest of the configuration is default.

I also checked other questions, there's only one that might come close https://dba.stackexchange.com/a/103806/235086, but I'm unsure if it applies here since it about the lag in seconds instead of size.

Best Answer

I found the problem, it's due to the inactivity on the database. Since there are no changes in the database outside office hours there are no changes to be consumed by the replica e.g. the Fivetran connector, therefore the LSN of the WAL is not advancing and the replication lag increases.

It seems that others had the same problem and came up with a solution called a 'WAL heartbeat' see 1, 2, 3, 4 and 5. This is a recurring proces e.g. cron job that writes a tiny amount of dummy data in the database to advance the replication slot.

Fivetran later confirmed that they've also seen this behaviour and are therefore rolling out WAL heartbeat support in Q3 2021.

Because there was no built-in WAL heartbeat support in RDS or from Fivetran (yet). I've implemented my own, by creating a recurring background job that creates a record every 15th minute e.g. 13:15, 13:30, 13:45. This solved the problem, as you can see the graph below. After the deploying the fix I could see the expected sharkteeth over the weekend when there's normally no activity.

replication_log_size_fixed

Debug information

I also want to share how I was able to debug it.

I stayed awake during the evening to see what's going on when the replication lag increases. The moment that the inactivity starts I could see the replication lag increasing. So this confirmed the hypothesis about the inactivity.

When I inspected the restart_lsn and confirmed_flush_lsn of pg_replication_slots they're were both the same, indicating that the replication was not lagging.

Also when the connector was performing its sync, the slot became active and I could see that the sent_lsn and write_lsn of pg_stat_replication were both the same, also indicating that the replication was not lagging.

However in the replication lag graph was still increasing, eventhough the pg_replication_slots and the pg_stat_replication didn't indicate a lag.

In order to debug the issue even further I decided to create a record to get some activity going on the database, after creating a record I synced the connector manually. That's when I saw that the replication lag plummeted down. See the graph below.

replication_log_size_drops_after_new_record_and_sync

Based on these observations I'm concluded that the replication lag is increasing because there are no changes to sync, causing the database to keep it's WAL.

Related Question