Postgresql – Aurora PostgreSQL CPU usage stuck at 100% for 8 hours

aurorapostgresql

One of our Aurora PostgreSQL database's(48 cores) CPU usage stuck at 100% for nearly 8 hours. These situation happened 3 times in the past 4 months. We are using AWS's business support plan, in the past few month, we worked closely with the support team to diagnose the issue, but still haven't found what caused the problem. Thus I post this problem in the forum, any advice is very much welcome.

The Problems

Problem 1: As mentioned above, the most obvious appearance of these problem is database's CPU stuck at 100% for a vary long time, then suddenly back to normal without intervention, as we can see from the two screenshot below(the second image is a screenshot from the database's Performance Insight panel):
enter image description here

enter image description here

Problem 2: When this incident happens, several empty tables that server as FIFO queues in our service, takes enormous time (roughly 500 ms) to execute a simple query:{code}select * from order by id limit 1;{code}
in normal circumstances, a query like might just take several ms. At that time, I check the auto_vacuum execution about these tables, seemed that auto_vacuum service run normally, but every time when it ran, failed to retrieve table's dead_tuple. Then I executed +vacuum + manually, still failed to retrieve dead_tuple.

What we have done

The vacuum problem

The vacuum problem is relatively easier, as Laurenz kindly pointed out, the reason why vacuum not working is there're long running queries.

The CPU problem

After the situation occurred, we check the database's log, didn't found any dead lock or slow query, also during or before that 8 hours, the traffic to our service remain as normal.

From the second image above, we can see that an +update+ SQL take a lot database's resource. AWS's support suspected that this SQL is what responsible for high CPU usage, as it might locked the table. We disagreed, for reasons that:

  • this SQL only locks a row
  • the transaction this SQL belongs to normally takes 3ms to 10 ms to execute, given no huge traffic and the databases uses 48 cores, it's hard to believe this amount of traffic can have the database stuck for such a long time

Also, CPU became busier way before the vacuum situation occurred, the images below show the relationship between max transaction ID and CPU usage(the first image used UTC time zone, the second used UTC+8, which I added according UTC time below, I'll use UTC in this question):

enter image description here

as it shows, the CPU usage steadily grown since 10/12 20:00(a small peak around 10/12 16:00 was we running a background job):

  • from 10/12 20:00 to 10/13 00:00, CPU usage grows from 40% to nearly 100%, however, during that time, auto_vacuum works fine, the max transaction ID never exceeded 200 million
  • after 10/13 00:00, auto_vacuum stopped working, and since then CPU stuck at 100%. But CPU usage back to normal at around 10/13 04:00 (we didn't intervene at that time)
  • after 10/13 4:00, max transaction ID kept growing, now we assume, as mentioned above, is because there're long running transactions.
  • at about 10/13 14:30, max transaction ID back to normal, after we reboot the database

What's next

After few months' effort but still didn't found the root cause, AWS's support team give us following advices:

  • change database's log_statement to ALL, so that when the incident happens again, can trace back what SQLs were executing
  • clean up long queries regularly
  • avoids uncommitted transactions
  • monitoring database's XID

Best Answer

There is a long running query that keeps your CPU busy and keeps autovacuum from cleaning up dead tuples. A long queries implies a long transaction.

The query takes long because of all the dead tuples.

Kill the long running query with pg_terminate_backend, and things will go back to normal.

To prevent long running queries, set statement_timeout.