linux-5.1/Documentation/cpu-load.txt
[…]
In most cases the
/proc/stat
information reflects the reality quite
closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all.[…]
If we imagine the system with one task that periodically burns cycles
in the following manner:time line between two timer interrupts |--------------------------------------| ^ ^ |_ something begins working | |_ something goes to sleep (only to be awaken quite soon)
In the above situation the system will be 0% loaded according to the
/proc/stat
(since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is
closer to 99%.
This document was added in 2007.
For example, has the CPU scheduler (e.g. schedule() function) been modified to measure the time every time a process transitions from runnable to waiting, if there is a sufficiently cheap and reliable time source (reliable TSC)?
The document includes an example program, smallhog.c
. According to the linked thread on LKML.org it was able to hog the CPU, and the kernel only reported a few % CPU usage or less.
I tried compiling and running it on my current system. The kernel reported the program's CPU usage as about 80%. So the situation appears to have changed a bit. Do we know exactly why smallhog.c
is less effective on this system?
I use Fedora 30, Linux kernel v5.2.0-rc5 (approximately), running in 64-bit mode on "Intel(R) Core(TM) i5-5300U CPU".
lscpu
showsconstant_tsc
andnonstop_tsc
.journalctl -k | grep -iE "TSC|clocksource"
looks like the kernel finds no problem with the TSC.cat /sys/devices/system/clocksource/clocksource0
shows "tsc".
I see the linked thread says
That is not true on all architecures, some do more accurate accounting by
recording the times at user/kernel/interrupt transitions …Indeed. It's certainly the way the common more boring pc architectures do it though.
(Maybe hrtick developments might have an affect on this issue? Even if only to make it more difficult to exploit. Or easier? Or just require slightly different code to exploit?).
Best Answer
You said the
smallhog
process shows 80% CPU time. The remaining 20% of the time on that CPU is accounted to interrupts! Why does smallhog.c show less than 100% CPU usage on my system?smallhog
is doing something very interrupt-intensive. Its specific tactic is clearly defeated byIRQ_TIME_ACCOUNTING
. See below.I suspect there is still a way to dodge the timer tick :-). You probably need a clever way to predict when the tick will fire. E.g. by looking at
/proc/interrupts
.This feature is enabled in Fedora kernel configurations (see
/boot/config-*
). On x86 CPUs, it uses the TSC. The feature can be disabled with a boot-time option,tsc=noirqtime
.[*]More accurate accounting methods
As mentioned in the question, PowerPC / S390 have specific code that can account CPU time on every single context switch. This is called
VIRT_CPU_ACCOUNTING_NATIVE
. But your x86 kernel does not have this.There is a generic equivalent, called
VIRT_CPU_ACCOUNTING_GEN
. (GEN is short for "generic"). This feature is built in to your Fedora kernel. But this feature is not activated by default.You have to read carefully :-).
VIRT_CPU_ACCOUNTING_GEN
only becomes active on "full dynticks systems". Although the Fedora kernel configuration includesNO_HZ_FULL
, Fedora does not enabled "full dynticks" by default. Enabling "full dynticks" requires specifying an option at boot time,nohz_full=
, with a list of "adaptive-ticks CPUs". ("At least one non-adaptive-tick CPU must remain online ...")See linux-5.2-rc5/init/Kconfig:
I marked a line through the last paragraph because it is outdated. "The full dynticks subsystem" has now been developed.
[*] TSC considerations
If an x86 CPU does not have a TSC, the kernel does not try to use any other hardware clock source for
IRQ_TIME_ACCOUNTING
(or forVIRT_CPU_ACCOUNTING_GEN
).The code suggests that any available TSC is accepted. I don't know how well this works with CPUs which do not have
constant_tsc
:-). Although I am 99.9% sure the relevant maintainers were aware of that issue, and would have asked why it was acceptable.See native_sched_clock() and tsc_init():