Way to discover why the server had a high load

freezeloadlogs

Just now I was forced to remotely reboot my CentOS 6.3 system due to a ultra high load (75!) that paralyzed the system. This is a web/mail server that serves a wordpress blog (mySQL + PHP).

Is there any log I can analyze and try to discover what cause that?

This is the email I have received now from the system about the event yesterday:

This is an automated message notifying you that the 5 minute load average on your system is 75.91.
This has exceeded the 10 threshold.

One Minute      - 83.24
Five Minutes    - 75.91
Fifteen Minutes - 39.35

top - 22:25:30 up 122 days,  7:28,  0 users,  load average: 99.14, 80.70, 42.31
Tasks: 298 total,   1 running, 297 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.2%us,  0.5%sy,  0.0%ni, 98.1%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1020176k total,   956828k used,    63348k free,     2788k buffers
Swap:  4194296k total,  1391900k used,  2802396k free,    25164k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
   1 root      20   0 19352  448  444 S  0.0  0.0   0:08.27 /sbin/init                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
   2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kthreadd]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
   3 root      RT   0     0    0    0 S  0.0  0.0   0:09.43 [migration/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   4 root      20   0     0    0    0 S  0.0  0.0   1884:48 [ksoftirqd/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   6 root      RT   0     0    0    0 S  0.0  0.0   0:06.06 [watchdog/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
   7 root      RT   0     0    0    0 S  0.0  0.0   0:07.81 [migration/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   9 root      20   0     0    0    0 S  0.0  0.0   7:25.62 [ksoftirqd/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  10 root      RT   0     0    0    0 S  0.0  0.0   0:04.58 [watchdog/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  11 root      20   0     0    0    0 S  0.0  0.0   4:48.95 [events/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  12 root      20   0     0    0    0 S  0.0  0.0   9:13.85 [events/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [cgroup]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  14 root      20   0     0    0    0 S  0.0  0.0   0:08.21 [khelper]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
  15 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [netns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  16 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [async/mgr]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [pm]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  18 root      20   0     0    0    0 S  0.0  0.0   0:21.72 [sync_supers]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  19 root      20   0     0    0    0 S  0.0  0.0   0:20.65 [bdi-default]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  20 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kintegrityd/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  21 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kintegrityd/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  22 root      20   0     0    0    0 S  0.0  0.0   5:26.09 [kblockd/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  23 root      20   0     0    0    0 S  0.0  0.0   0:22.90 [kblockd/1] 

I am not sure if this can help.

Everything appears to be using 0% of CPU…

this is another email…

This is an automated message notifying you that the 5 minute load average on your system is 70.53.
This has exceeded the 10 threshold.

One Minute      - 94.79
Five Minutes    - 70.53
Fifteen Minutes - 32.68

top - 22:23:34 up 122 days,  7:26,  0 users,  load average: 96.88, 74.74, 35.91
Tasks: 283 total,   2 running, 281 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.2%us,  0.5%sy,  0.0%ni, 98.1%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1020176k total,   970440k used,    49736k free,     3196k buffers
Swap:  4194296k total,  1249404k used,  2944892k free,    29836k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
6715 apache    20   0  217m 7804 3252 D  1.9  0.8   0:00.28 /usr/sbin/httpd -k start -DSSL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
6770 apache    20   0  218m 8772 3368 D  1.9  0.9   0:00.28 /usr/sbin/httpd -k start -DSSL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
6799 apache    20   0  301m 8088 3184 D  1.9  0.8   0:00.14 /usr/sbin/httpd -k start -DSSL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
7265 root      20   0 15160 1220  808 R  1.9  0.1   0:00.02 /usr/bin/top -c -b -n 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
7266 root      20   0 15160 1220  808 R  1.9  0.1   0:00.02 /usr/bin/top -c -b -n 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
   1 root      20   0 19352  448  444 S  0.0  0.0   0:08.27 /sbin/init                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
   2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kthreadd]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
   3 root      RT   0     0    0    0 S  0.0  0.0   0:09.43 [migration/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   4 root      20   0     0    0    0 S  0.0  0.0   1884:48 [ksoftirqd/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   6 root      RT   0     0    0    0 S  0.0  0.0   0:06.06 [watchdog/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
   7 root      RT   0     0    0    0 S  0.0  0.0   0:07.81 [migration/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   9 root      20   0     0    0    0 S  0.0  0.0   7:22.58 [ksoftirqd/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  10 root      RT   0     0    0    0 S  0.0  0.0   0:04.58 [watchdog/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  11 root      20   0     0    0    0 S  0.0  0.0   4:48.95 [events/0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  12 root      20   0     0    0    0 S  0.0  0.0   9:13.85 [events/1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [cgroup]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  14 root      20   0     0    0    0 S  0.0  0.0   0:08.21 [khelper]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
  15 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [netns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  16 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [async/mgr]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [pm]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  18 root      20   0     0    0    0 S  0.0  0.0   0:21.72 [sync_supers]                       

Yes, I am using Apache. 75 is the load average saw on top for the last 5 minutes.

Best Answer

The amount of swap used suggests that swapping might be to blame. The output of vmstat would show this better during the problem scenario.

vmstat 1 30

However, neither top or vmstat are well suited for diagnosing issues after the fact.

My general advice would be to install the sysstat package. This will enable system metrics to be saved periodically and the information can then be retrieved later by sar. Sysstat may be configured for substantial detail, but the default configuration will give you an initial overview over CPU usage, system load, paging and swapping.

yum install sysstat

sar 
sar -q
sar -B
sar -W

If this reveals little of use, you may need to look deeper, however. Something may be going on that is not immediately visible through the common performance metrics, other than the process queue (load average). One possibility is that the CPU is preoccupied with excessive interrupt requests, causing processes to queue up for what little processing time remains available to the system.

If that is the case, you may be able to find some clues in /proc/interrupts

cat /proc/interrupts

Perhaps network adapters or the local timer show an unusually high number of interrupts?

It may come down to familiarizing yourself with perf and waiting for the next occurrence of the problem. Start recording once the trouble starts, or automate the process with a script that triggers on high load average.

perf record -a

perf report

Perf provides an incredibly detailed view of operations on the system, but also collects a lot of data and causes significant overhead, making it impractical to have it running continuously.

In the case of CentOS 6.3 there is a chance that mysterious high load averages will go away with an upgrade to CentOS 6.4, which resolves a bug in the kernel source that relates to the local timer. However, your problem could just as well be caused by any given driver that may have been provided by your hardware vendor.

Related Question