Ubuntu – 16 cores are not being utilized out of 80 cores

14.04cpucpu load

Recently I discovered that our server does not utilize all 80 threads anymore in the system. It looks like if 16 cores are always idle, despite the high system load.

Its a Dell powerEdge R900 server, with 4 sockets, 4 times a 10-core Xeon. So 40 cores, with HT its 80 threads.
(Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz). System memory is 512GB
Running Ubuntu 14.04.1 LTS.
I haven't rebooted the server yet, I was hoping to avoid this.

uname -a
Linux assembly 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I'v check the following:

Temperature measured with i7z: (it cannot display 4 sockets

Cpu speed from cpuinfo 1994.00Mhz
True Frequency (without accounting Turbo) 1994 MHz

Socket [0] - [physical cores=10, logical cores=20, max online cores ever=10]
  CPU Multiplier 15x || Bus clock frequency (BCLK) 132.93 MHz
  TURBO ENABLED on 10 Cores, Hyper Threading ON
  Max Frequency without considering Turbo 2126.93 MHz (132.93 x [16])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is  0x/0x/0x/0x/0x/0x
  Real Current Frequency 1994.02 MHz (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %  Temp
        Core 1 [1]:       1994.01 (15.00x)       100       0       0       0    75
        Core 2 [5]:       1994.00 (15.00x)       100       0       0       0    77
        Core 3 [9]:       1994.02 (15.00x)       100       0       0       0    76
        Core 4 [13]:      1994.00 (15.00x)       100       0       0       0    77
        Core 5 [17]:      1994.00 (15.00x)       100       0       0       0    77
        Core 6 [21]:      1994.00 (15.00x)      97.7    0.404      0    1.86    77
        Core 7 [25]:      1994.00 (15.00x)      94.5       0       1    5.27    77
        Core 8 [29]:      1994.00 (15.00x)       100       0       0       0    76
        Core 9 [33]:      1994.00 (15.00x)      99.8       0       1       1    75
        Core 10 [37]:     1994.00 (15.00x)       100       0       0       0    73
  Max Frequency without considering Turbo 2126.93 MHz (132.93 x [16])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is  0x/0x/0x/0x/0x/0x
  Real Current Frequency 1994.02 MHz (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %  Temp
        Core 1 [1]:       1994.02 (15.00x)       100       0       0       0    74
        Core 2 [5]:       1994.00 (15.00x)       100       0       0       0    76
        Core 3 [9]:       1994.02 (15.00x)       100       0       0       0    76
        Core 4 [13]:      1994.00 (15.00x)       100       0       0       0    77
        Core 5 [17]:      1994.00 (15.00x)       100       0       0       0    76
        Core 6 [21]:      1994.00 (15.00x)        97       0       1    2.43    77
        Core 7 [25]:      1994.00 (15.00x)      92.9       0       1    6.81    77
C0 = Processor running without halting00x)       100       0       0       0    75
C1 = Processor running with halts (States >C0 are power saver)     1       1    75
C3 = Cores running with PLL turned off and core cache turned off   0       0    73
C6 = Everything in C3 + core state saved to last level cache
  Above values in table are in percentage over the last 1 sec
[core-id] refers to core-id number in /proc/cpuinfo
'Garbage Values' message printed when garbage values are read
  Ctrl+C to exit

Idle: Last 16 cores are all 100 % idle:

mpstat -p ALL 1:
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   70.69    0.00    0.70    0.00    0.00    0.00    0.00    0.00    0.00   28.61
Average:       0   92.93    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.07
Average:       1   94.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.00
Average:       2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
Average:       3   83.33    0.00    2.08    0.00    0.00    0.00    0.00    0.00    0.00   14.58
Average:       4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
Average:       5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
..........................................................
Average:      64    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      65    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      66    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      67    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      68    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      70    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      71    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      72    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      73    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      74    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      75    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      76    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      77    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      78    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      79    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Top load numbers:

top - 17:41:48 up 35 days,  6:28, 15 users,  load average: 77.69, 70.48, 62.73
Tasks: 1327 total,  44 running, 1281 sleeping,   2 stopped,   0 zombie
%Cpu(s): 63.7 us, 13.6 sy,  0.0 ni, 22.3 id,  0.2 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem:  52837942+total, 52553190+used,  2847524 free,   535660 buffers
KiB Swap: 78124032 total,  2105608 used, 76018416 free. 40637328+cached Mem

Sometimes the idle % is not 100 anymore, but a bit less, as you can see here, but mostly it stays at 100% idle.

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   70.69    0.00    0.70    0.00    0.00    0.00    0.00    0.00    0.00   28.61
Average:      64    0.13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.87
Average:      65    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      66    0.00    0.00    2.63    0.00    0.00    0.00    0.00    0.00    0.00   97.37
Average:      67    0.00    0.00    0.13    0.13    0.00    0.00    0.00    0.00    0.00   99.75
Average:      68    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      70    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      71    0.00    0.00    0.12    0.00    0.00    0.00    0.00    0.00    0.00   99.88
Average:      72    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      73    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      74    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      75    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      76    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      77    0.00    0.00    0.13    0.00    0.00    0.00    0.00    0.00    0.00   99.87
Average:      78    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      79    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

I also ran this command to make sure they are all online:

for COUNT in `seq 01 79`;do echo 1 > /sys/devices/system/cpu/cpu${COUNT}/online; 

With the HTop program I can visualize a bar of cpu usage per thread, and is see 64 bars filled and 16 empty ones (the last 16).

When I try to start a process at a core >63 is also fails doing so:

root@server:~# taskset -c 63 time
Usage: time [-apvV] [-f format] [-o file] [--append] [--verbose]
       [--portability] [--format=format] [--output=file] [--version]
       [--quiet] [--help] command [arg...]
root@server:~# taskset -c 64 time
taskset: failed to set pid 0's affinity: Invalid argument
root@server:~# taskset -c 65 time
taskset: failed to set pid 0's affinity: Invalid argument

Related thread:
https://askubuntu.com/questions/536541/ubuntu-uses-only-2-out-of-4-processor-cores

EDIT:
It turns out that that the cores are shutdown on the fly, but do not startup properly. It seems that there are processes running on these unavailable cores, but its impossible to start any new process on them. According to the dmesg log, cores are disabled and enabled quickly after one another. I have to say that it was the intention to shutdown these cores, so we disabled this 'feature'.
DMESG example log:

[Mon Jan 12 12:42:40 2015] kvm: disabling virtualization on CPU79
[Mon Jan 12 12:42:40 2015] smpboot: CPU 79 is now offline
....
[Mon Jan 12 12:43:12 2015] smpboot: Booting Node 0 Processor 79 APIC 0xf3
[Mon Jan 12 12:43:12 2015] kvm: enabling virtualization on CPU79

We enabling/disabling cores via:

for COUNT in `seq 64 79`;do echo 1 > /sys/devices/system/cpu/cpu${COUNT}/online;done

We never linked this commands with our 16 unavailable cores, since normally the commands above work properly. (we also tried disabling powermanager, but this did not help)

Best Answer

Not all programs can use multiple threads. php is one for example. If one php process needs a lot of CPU, only 1 CPU will max out. And the others will be idle.

Related Question