Postgresql – Debugging why postgres is snowballing with system cpu usage

performancepostgresqlpostgresql-9.1

I am having an issue where my postgres database which run on a dedicated machine is getting hammered with kernal level cpu usage. This essentially takes down the web service that run on this database.

This is a dual 6 core machine with 128 GB of ram.

max_connections = 609
shared_buffers = 23923MB
work_mem = 78MB
maintenance_work_mem = 4784MB 
wal_buffers = 16MB
checkpoint_segments = 10
effective_cache_size = 71769MB

enter image description here

I have hit a wall in debugging this issue trying to correlate this with any cron tasks or heavy work that utilizes the db.

When I am looking at htop while this is occurring I see a lot of process like:

CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) authentication
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) authentication
CPU 99-100% postgres: user database ip(pid) idle
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) startup
CPU 99-100% postgres: user database ip(pid) SELECT
CPU 99-100% postgres: user database ip(pid) startup
...

pg_stat_activity is not swamped with more queries than normal or queries that take a particularly long time to run when this issue is not occurring. All queries during an occurrence are taking 10's of seconds to process each. Even queries that just check to see if the table exists or not.

I would love it if I could be pointed in a good direction to further debug this issue.

Best Answer

It has been about 24 hours now since I have implemented the change that fixed this, so I think it is safe to call. I will update this however if this kind of spike happens again.

In the system logs we were getting this message and I had my data center replace the RAM in the machine.

Mar  3 18:40:24 hubble kernel: sbridge: HANDLING MCE MEMORY ERROR
Mar  3 18:40:24 hubble kernel: CPU 6: Machine Check Exception: 0 Bank 5: 8c00004000010092
Mar  3 18:40:24 hubble kernel: TSC 0 ADDR 1087cc7780 MISC 20400e0e86 PROCESSOR 0:206d7 TIME 1393900824 SOCKET 1 APIC 20
Mar  3 18:40:24 hubble kernel: sbridge: HANDLING MCE MEMORY ERROR
Mar  3 18:40:24 hubble kernel: CPU 6: Machine Check Exception: 0 Bank 5: 8c00004000010092
Mar  3 18:40:24 hubble kernel: TSC 0 ADDR 1087cc7780 MISC 20400e0e86 PROCESSOR 0:206d7 TIME 1393900824 SOCKET 1 APIC 20
Mar  3 18:40:24 hubble kernel: EDAC MC1: CE row 9, channel 0, label "CPU_SrcID#1_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=6 Err=0001:0092 (ch=2), addr = 0x1087cc7780 => socket=1, Channel=3(mask=8), rank=0

After the RAM replacement, the issue has not occurred since.

Also the IOwait in the CPU graph has gone a way since the ram change. Previously it was my assumption that his was just the profile of the machine since I had more than a year of uptime with this IOwait being detected my CPU profiling.