Ubuntu – High CPU load, low core usage, (ECC) memory error in kernel

15.10cpu loadkernelraidram

I am having a super weird behavior …
The CPU load of my computer goes through the roof (>4 on a 8core machine) but there is no process that is taking much CPU (see attached image) Though the 8 core of the machine are experiencing high load (htop shows them all being in between 30-70% oscillating.

CpuLoad Top output

This behavior appears after X minutes of using the computer (random, ranging from a couple minutes to a couple hours).
Moreover, after this happened, the computer will eventually come to a freeze.

I am at loss here, I had this problem on 15.04, updated to 15.10, same.

The machine has those parts:
Motherboard : Asus Z10PE-D8WS
CPU: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
RAM: 2x Kingston 16Go PC4-2133 CL15 – ECC Registered (KVR21R15D4/16)
HDD: 2x 2To ATA ST2000DM001-1ER1 in Raid 0

The only odd thing I found was those lines in the kernel log:

Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17386.894665] CMCI storm detected: switching to poll mode
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.299974] EDAC MC0: 4 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x1042 offset:0x100 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.299989] EDAC MC0: 4 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x85392b offset:0xa80 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:1)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.299999] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x850da9 offset:0x580 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300009] EDAC MC0: 3 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x85f599 offset:0x100 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300018] EDAC MC0: 3 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x11b2 offset:0x780 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300022] EDAC MC0: 2 CE Error at MMIOH area, on addr 0x000000087fd43a40 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300032] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8474e2 offset:0xf00 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300042] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8476f8 offset:0xd80 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300051] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8466eb offset:0x500 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300060] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x846b23 offset:0x7c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300070] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x846b23 offset:0xcc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300080] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x846d32 offset:0xe40 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300089] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x5c251b offset:0x640 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:1)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300099] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8474e3 offset:0x1c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0)
Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300108] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x847711 offset:0xf40 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0)
Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891537] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891561] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc08388000010090
Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891566] EDAC sbridge MC0: TSC 0 
Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891569] EDAC sbridge MC0: ADDR 87fc60500 EDAC sbridge MC0: MISC 14032b286 
Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891576] EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1455579963 SOCKET 0 APIC 0
Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17388.299184] EDAC MC0: 8418 CE Error at MMIOH area, on addr 0x000000087fc60500 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)
Feb 15 18:51:03 XXXX-Z10PE-D8-WS kernel: [17687.707744] CMCI storm subsided: switching to interrupt mode

with those lines repeating a lot

Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236569] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236586] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00064000010090
Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236589] EDAC sbridge MC0: TSC 0 
Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236592] EDAC sbridge MC0: ADDR 103fb00 EDAC sbridge MC0: MISC 4062e286 
Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236597] EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1455581267 SOCKET 0 APIC 0

spaced by some

Feb 15 19:07:48 XXXX-Z10PE-D8-WS kernel: [18692.381405] EDAC MC0: 26415 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1042 offset:0xa00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
Feb 15 19:07:48 XXXX-Z10PE-D8-WS kernel: [18692.381481] EDAC MC0: 4 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7c5acf offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)

Help !

Best Answer

Still tracking this issue? It seems like you have a bad memory module, the machine pauses just waiting for the hardware to correct this error by itself. You may need to try to remove or replace the memory at your first CPU, second channel and first slot. Please refer: https://serverfault.com/questions/569289/server-freezes-completely-in-unknown-condition

Hope it helps.