Postgresql – Monitoring for and dealing with sudden jump in TOAST bloat

amazon-rdsblobdisk-spacemonitoringpostgresql

I'm trying to diagnose a sudden jump in disk usage in one of my databases (running 11.4). Typically the database grows by ~1 GB per day, but recently in the span of 24h it grew by nearly 100 GB.

Looking at various statistics, I concluded the growth came from a particular TOAST table. The table is used for a BYTEA column which stores an encrypted blob of a couple of MBs. There are ~1000 rows, but the TOAST table takes up 400 GB.

These rows are updated frequently for a few days after they're first created and then aren't touched again.

The DB is a multi-AZ RDS deployment, so my best hypothesis is that replication fell behind, causing a bunch tuples to be kept alive and making the TOAST table suddenly grow.

Is that a sensible hypothesis? How can I confirm if it's the case? What are possible workarounds to avoid this hitting us again?


Additional data:

  • The pg_toast table has ~900k live tuples
  • Dead tuples were at 5.2M this morning but at 1.7M now
  • The pg_table_size hasn't budged in a few hours for the pg_toast table, despite continued activity on the underlying table

Best Answer

If you want to check how bloated the table is, you should install the pgstattuple extension and execute pgstattuple() on the suspected table. If the table is large, as in your case, you can use pgstattuple_approx() to get an approximation.

Regarding your hypothesis, I don't think that Multi-AZ can explain your issue, as PostgreSQL Multi-AZ replication is synchronous. Network issues between master and standby would halt the activity on the master.

Other possible explanation might be:

  1. Autovacuum not being able to catch up with many deletes or updates. This can happen because of database load that is unrelated to that query. Look at your logs and CloudWatch metrics for clues
  2. If you have a regular read-replica (not Multi-AZ) then your hypothesis could explain what has happened. There is a CloudWatch metric that shows the replication lag.