Postgresql – Finding cause of high write IOPS on RDS Postgesql

amazon-rdspostgresql

Our RDS Postgresql 9.6 instance has had an unusual instance of high write IOPS. Write iops went from an idle level of ~10 to over 4000 (our provisioned level) yesterday evening, and stayed at that level for 14 hours. There was no matching spike of write IOPS on our read replicas, no rise in network traffic, and no read IOPS. I am looking for advice on diagnosing this, and preparing for future incidents.

Cloudwatch metrics of incident

This morning while IOPS were still high at 10am I looked for long running processes:

SELECT * FROM pg_stat_activity order by query_start asc;

I found two processes idle in transaction from the day before, and terminated them at 10:05. At that point there was no query_start older than 2 minutes, but the write iops remained at 4000. At 10:25 write iops returned to its normal level: I have no evidence that terminating the processes caused that.

Throughout the incident my logs (default AWS configuration) contain nothing but normal checkpoint logging:

2020-03-02 23:55:07 UTC::@:[6884]:LOG:  checkpoint starting: time
2020-03-02 23:55:31 UTC::@:[6884]:LOG:  checkpoint complete: wrote 85 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=22.426 s, sync=0.000 s, total=23.549 s; sync files=20, longest=0.000 s, average=0.000 s; distance=16419 kB, estimate=22527 kB

One cause I've considered is autovacuum freeze: cloudwatch shows maximum used transaction id reaching a peak of 199,999,946 then dropping. But:

  • I've tested with a manual vacuum freeze, and it causes near simultaneous write iops on source and replica.
  • The writes don't start until an hour after the drop in transaction id.
  • The data written is more than 2TB on an 80GB database.

This doesn't cause an outage since we're on provisioned iops, but I'd like to understand the cause. Could this be autovacuum, and what logging options should I change from the RDS defaults to see what's going on?

Best Answer

To log autovacuum activity you can set log_autovacuum_min_duration parameter:

Causes each action executed by autovacuum to be logged if it ran for at least the specified number of milliseconds. Setting this to zero logs all autovacuum actions. Minus-one (the default) disables logging autovacuum actions. For example, if you set this to 250ms then all automatic vacuums and analyzes that run 250ms or longer will be logged. In addition, when this parameter is set to any value other than -1, a message will be logged if an autovacuum action is skipped due to the existence of a conflicting lock. Enabling this parameter can be helpful in tracking autovacuum activity. This parameter can only be set in the postgresql.conf file or on the server command line; but the setting can be overridden for individual tables by changing table storage parameters.

One possible cause could also be temporary files used by some queries for sorting and/or joins: you can try to check in pg_stat_database view columns temp_files and temp_bytes . You can also set log_temp_files parameter to 0 (instead of default -1) to log all temporary files creation:

Controls logging of temporary file names and sizes. Temporary files can be created for sorts, hashes, and temporary query results. A log entry is made for each temporary file when it is deleted. A value of zero logs all temporary file information, while positive values log only files whose size is greater than or equal to the specified number of kilobytes. The default setting is -1, which disables such logging. Only superusers can change this setting.