100.0%sy with poor disk performance

ioperformancetop

I have RHEL 5.4 kernel 2.6.18-164.el5 and I am experiencing very bad disk performance that happens randomly during working with Oracle.

Nothing appears in the logs.

When it happens, I see one CPU core stuck at 100% system time in top:

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8097368k total,  5761028k used,  2336340k free,   602024k buffers
Swap:  2088440k total,        0k used,  2088440k free,  3070188k cached

Checking disk writes with dd I get same on top and it shows:

time dd if=/dev/zero of=1000 bs=2M count=500 conv=fdatasync

79+0 records in
79+0 records out
165675008 bytes (166 MB) copied, 279.746 seconds, 592 kB/s

real    4m40.565s
user    0m0.000s
sys 4m40.521s 

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
8557 root      25   0 65212 2640 2548 R 99.8  0.0   3:02.99 dd

During normal operations I get:

500+0 records in
500+0 records out
1048576000 bytes (1.0 GB) copied, 9.24778 seconds, 113 MB/s

real    0m9.249s
user    0m0.001s
sys 0m1.772s

The disks are two 15k RPM SAS in a RAID1 configuration managed by a MegaRAID SAS9261-8i controller.

I have already upgraded both the driver and the firmware of the controller.

The strange problem is that the system can work fine for weeks and all disk benchmarks show good results.

How can this poor disk performance be debugged?

for Patrick request this is the output during 100%sy

top - 14:12:57 up 13 days, 15:49,  3 users,  load average: 1.28, 1.48, 1.17
Tasks: 424 total,   2 running, 422 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  1.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8097368k total,  6167440k used,  1929928k free,   306644k buffers
Swap:  2088440k total,        4k used,  2088436k free,  3638216k cached

mpstat -P ALL 3 1

02:13:19 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
02:13:22 PM  all    0.06    0.00    6.50    0.00    0.02    0.00    0.00   93.42   1077.00
02:13:22 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1001.33
02:13:22 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM    2    0.00    0.00    0.00    0.00    0.33    0.00    0.00   99.67     43.67
02:13:22 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM    4    0.00    0.00    0.00    0.33    0.00    0.00    0.00   99.67     16.33
02:13:22 PM    5    0.33    0.00    2.99    0.00    0.00    0.00    0.00   96.68      0.00
02:13:22 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.33
02:13:22 PM    7    0.33    0.00    0.00    0.00    0.00    0.00    0.00   99.67      0.00
02:13:22 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM    9    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM   10    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM   11    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM   12    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM   13    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
02:13:22 PM   14    0.00    0.00  100.00    0.00    0.00    0.00    0.00    0.00     14.33
02:13:22 PM   15    0.66    0.00    1.00    0.00    0.00    0.00    0.00   98.34      0.00

Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
Average:     all    0.06    0.00    6.50    0.00    0.02    0.00    0.00   93.42   1077.00
Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1001.33
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:       2    0.00    0.00    0.00    0.00    0.33    0.00    0.00   99.67     43.67
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:       4    0.00    0.00    0.00    0.33    0.00    0.00    0.00   99.67     16.33
Average:       5    0.33    0.00    2.99    0.00    0.00    0.00    0.00   96.68      0.00
Average:       6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.33
Average:       7    0.33    0.00    0.00    0.00    0.00    0.00    0.00   99.67      0.00
Average:       8    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:       9    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      10    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      11    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      12    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      13    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      14    0.00    0.00  100.00    0.00    0.00    0.00    0.00    0.00     14.33
Average:      15    0.66    0.00    1.00    0.00    0.00    0.00    0.00   98.34      0.00

sar -I XALL 3 1
interrupts higher than 0

Average:         INTR    intr/s
Average:            0   1000.33
Average:           51      7.33
Average:           59      1.00
Average:           75     23.33
Average:          218     61.67
Average:          233      0.33

it seems not interrupts issue

Related Question