CentOS 7 cpupower not setting the CPU’s to maximum

acpicentoscpu-frequencypower-management

On my machine, i have power savings disable on the BIOS. When I run Ubuntu on this machine, the CPU's are running maximum at 2100 MHz.

However, when I'm running CentOS 7, even with the scaling governor set to "performance", the CPU's are not running full at 2100 MHz.

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

grep -i mhz /proc/cpuinfo
cpu MHz         : 1688.285
cpu MHz         : 2058.656
cpu MHz         : 1622.988
cpu MHz         : 2070.632

My application is a network processing application, so the CPU clock difference is resulting a performance difference.

I do not have cpuspeed,powerd, or any other power regulating services running.

So my question is, is this expected behavior given the cpupower is set to "perform"?
It seems CentOS is override the BIOS setting.

When I boot to Ubuntu, it just takes the BIOS settings and all the CPU's are running max.

Best Answer

The solution i've found is to disable intel pstate in grub

Tuning the ondemand CPU DVFS governor

The ondemand governor has a set of parameters to control when it is kicking the dynamic frequency scaling (or DVFS for dynamic voltage and frequency scaling). Those parameters are located under the sysfs tree: /sys/devices/system/cpu/cpufreq/ondemand/

One of this parameters is up_threshold which like the name suggest is a threshold (unit is % of CPU, I haven't find out though if this is per core or merged cores) above which the ondemand governor kicks in and start changing dynamically the frequency.

To change it to 50% (for example) using sudo is simple:
sudo bash -c "echo 50 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold"

If you are root, an even simpler command is possible:
echo 50 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold

Note: those changes will be lost after the next host reboot. You should add them to a configuration file that is read during boot, like /etc/init.d/rc.local on Ubuntu.

I have found out that my guest VM, although consuming a lot of CPU (80-140%) on the host was distributing the load on both cores, so no single core was above 95%, thus the CPU, to my exasperation, was staying at 800 MHz. Now with the above patch, the CPU dynamically changes it frequency per core much faster, which suits better my needs, 50% seems a better threshold for my guest usage, your mileage may vary.

Optionally, verify if you are using HPET

It is possible that some applicable which incorrectly implement timers might get affected by DVFS. This can be a problem on the host and/or guest environment, though the host can have some convoluted algorithm to try to minimise this. However, modern CPU have newer TSC (Time Stamp Counter) which are independent of the current CPU/core frequency, those are: constant (constant_tsc), invariant (invariant_tsc) or non-stop (nonstop_tsc), see this Chromium article about TSC resynchronisation for more information on each. So if your CPU is equipped with one of this TSC, you don't need to force HPET. To verify if your host CPU supports them, use a similar command (change the grep parameter to the corresponding CPU feature, here we test for the constant TSC):

$ grep constant_tsc /proc/cpuinfo

If you do not have one of this modern TSC, you should either:

Active HPET, this is described here after;
Not use CPU DVFS if you have any applications in the VM that rely on precise timing, which is the one recommended by Red Hat.

A safe solution is to enable HPET timers (see below for more details), they are slower to query than TSC ones (TSC are in the CPU, vs. HPET are in the motherboard) and perhaps not has precise (HPET >10MHz; TSC often the max CPU clock) but they are much more reliable especially in a DVFS configuration where each core could have a different frequency. Linux is clever enough to use the best available timer, it will rely on first the TSC, but if found too unreliable, it will use the HPET one. This work good on host (bare metal) systems, but due to not all information properly exported by the hypervisor, this is more of a challenge for the guest VM to detect badly behaving TSC. The trick is then to force to use HPET in the guest, although you would need the hypervisor to make this clock source available to the guests!

Below you can find how to configure and/or enable HPET on Linux and FreeBSD.

Linux HPET configuration

HPET, or high-precision event timer, is a hardware timer that you can find in most commodity PC since 2005. This timer can be used efficiently by modern OS (Linux kernel supports it since 2.6, stable support on FreeBSD since latest 9.x but was introduced in 6.3) to provide consistent timing invariably to CPU power management. It allows to build also easier tick-less scheduler implementations.

Basically HPET is like a safety barrier which even if the host has DVFS active, the host and guest timing events will be less affected.

There is a good article from IBM regarding enabling HPET, it explains how to verify which hardware timer your kernel is using, and which are available. I provide here a brief summary:

Checking the available hardware timer(s):
cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Checking the current active timer:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Simpler way to force usage of HPET if you have it available is to modify your boot loader to ask to enable it (since kernel 2.6.16). This configuration is distribution dependant, so please refer to your own distribution documentation to set it properly. You should enable hpet=enable or clocksource=hpet on the kernel boot line (this again depends on the kernel version or distribution, I did not find any coherent information).
This make sure that the guest is using the HPET timer.

Note: on my kernel 3.5, Linux seems to pick-up automatically the hpet timer.

FreeBSD guest HPET configuration

On FreeBSD one can check which timers are available by running:
sysctl kern.timecounter.choice

The currently chosen timer can be verified with:
sysctl kern.timecounter.hardware

FreeBSD 9.1 seems to automatically prefer HPET over other timer provider.

Todo: how to force HPET on FreeBSD.

Hypervisor HPET export

KVM seems to export HPET automatically when the host has support for it. However, for Linux guest they will prefer the other automatically exported clock which is kvm-clock (a paravirtualised version of the host TSC). Some people reports trouble with the preferred clock, your mileage may vary. If you want to force HPET in the guest, refer to the above section.

VirtualBox does not export the HPET clock to the guest by default, and there is no option to do so in the GUI. You need to use the command line and make sure the VM is powered off. the command is:

./VBoxManage modifyvm "VM NAME" --hpet on

If the guest keeps on selecting another source than HPET after the above change, please refer to the above section how to force the kernel to use HPET clock as a source.

Linux – What are the implications of setting the CPU governor to “performance”

For the record, the (up-to-date) cpufreq documentation is here.

What does "statically" mean?To me, it contrasts with "dynamic", and implies frequency would never change, i.e. with powersave the CPU frequency would always be a single value, equal to scaling_min_freq

You're right. Back in the old cpufreq driver days, there were two kinds of governors: dynamic ones and static ones. The difference was that dynamic governors (ondemand and conservative) could switch between CPU frequencies based on CPU utilization whereas static governors (performance and powersave) would never change the CPU frequency.
However, as you have noticed, with the new driver

this is clearly not the case.

This is because the new driver, which is called intel_pstate, operates differently. The p-states aka operation performance points involve active power management and race to idle which means scaling voltage and frequency. For more details see the official documentation.
As to your actual question,

What are the implications of setting the CPU governor to "performance" ?

it's also answered in the same document. As with all Skylake+ processors, the operating mode of your CPU is - by default - "Active Mode with HWP" so the implications of using the performance governor are (emphasize mine):

HWP + performance

In this configuration intel_pstate will write 0 to the processor’s Energy-Performance Preference (EPP) knob (if supported) or its Energy-Performance Bias (EPB) knob (otherwise), which means that the processor’s internal P-state selection logic is expected to focus entirely on performance,.

This will override the EPP/EPB setting coming from the sysfs interface (see Energy vs Performance Hints below).
Also, in this configuration the range of P-states available to the processor’s internal P-state selection logic is always restricted to the upper boundary (that is, the maximum P-state that the driver is allowed to use).

In a nutshell:
intel_pstate is actually a governor and a hardware driver all in one. It supports two policies:

the performance policy always picks the highest p-state: maximize the performance and then go back down to a virtual zero energy draw state, also called "Race to Idle"
the powersave policy attempts to balance performance with energy savings: it selects the appropriate p-state based on CPU utilization (load at this specific p-state, will probably go down when going to a higher p-state) and capacity (maximum performance in highest p-state)