Postgresql – How to troubleshoot hung queries in PostgreSQL RDS

amazon-rdspostgresqlquery

I have strange behavior from an Amazon RDS instance running PostgreSQL 9.6.11. This is new behavior from a system that has otherwise been behaving properly without any changes to code in the last few months. Additionally, we have a staging environment that the latest database and code changes were replicated to and it is not having these problems.

Nightly, thousands of jobs are run against the database server that consist mainly of very large SELECT queries (In the order of 8000 lines or so) that calculate statistics. Queries that typically take up to 1 minute, hang up, and do not complete, even after hours. However, if the server is rebooted to clear the queries, then I can run the same queries manually and they complete normally. The jobs will continue to process normally, and then eventually hang up again on stuck SELECT statements (usually 5 or 6, and it is not the same statements. They all vary slightly.).

I have done a lot of research and troubleshooting, but after determining that the query wasn't the problem, I'm at a loss.

The hung statements are all in active state and have been processing for hours. The server is pegged at 99%+ load.

What things should I look for? What problems could cause this issue? Why would a combination of large SELECT statements hang up? What commands can I run to get more information?

Best Answer

The most important information in addition to the connection state of the hung queries would be whether they are in any particular wait event.

You can get that information by running “SELECT state, wait_event_type, wait_event FROM pg_stat_activity” (you may need to collect this in a cron job to get the data over time)

Note that RDS also shows this information in “Performance Insights”, if you have it enabled.

Since you are on an older Postgres patch release, I would also investigate an upgrade of the minor version to see if that fixes the issue you are seeing.

Related Question