Linux CPU Usage – Are There New Mitigations for Misleading Linux CPU Load

cpu usagelinuxlinux-kernel

linux-5.1/Documentation/cpu-load.txt

[…]

In most cases the /proc/stat information reflects the reality quite
closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all.

[…]

If we imagine the system with one task that periodically burns cycles
in the following manner:
 time line between two timer interrupts
|--------------------------------------|
 ^                                    ^
 |_ something begins working          |
                                      |_ something goes to sleep
                                     (only to be awaken quite soon)
In the above situation the system will be 0% loaded according to the
/proc/stat (since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is
closer to 99%.

This document was added in 2007.

For example, has the CPU scheduler (e.g. schedule() function) been modified to measure the time every time a process transitions from runnable to waiting, if there is a sufficiently cheap and reliable time source (reliable TSC)?

The document includes an example program, smallhog.c. According to the linked thread on LKML.org it was able to hog the CPU, and the kernel only reported a few % CPU usage or less.

I tried compiling and running it on my current system. The kernel reported the program's CPU usage as about 80%. So the situation appears to have changed a bit. Do we know exactly why smallhog.c is less effective on this system?

I use Fedora 30, Linux kernel v5.2.0-rc5 (approximately), running in 64-bit mode on "Intel(R) Core(TM) i5-5300U CPU".

lscpu shows constant_tsc and nonstop_tsc.
journalctl -k | grep -iE "TSC|clocksource" looks like the kernel finds no problem with the TSC.
cat /sys/devices/system/clocksource/clocksource0 shows "tsc".

I see the linked thread says

That is not true on all architecures, some do more accurate accounting by
recording the times at user/kernel/interrupt transitions …

Indeed. It's certainly the way the common more boring pc architectures do it though.

(Maybe hrtick developments might have an affect on this issue? Even if only to make it more difficult to exploit. Or easier? Or just require slightly different code to exploit?).

Best Answer

You said the smallhog process shows 80% CPU time. The remaining 20% of the time on that CPU is accounted to interrupts! Why does smallhog.c show less than 100% CPU usage on my system?

smallhog is doing something very interrupt-intensive. Its specific tactic is clearly defeated by IRQ_TIME_ACCOUNTING. See below.

I suspect there is still a way to dodge the timer tick :-). You probably need a clever way to predict when the tick will fire. E.g. by looking at /proc/interrupts.

config IRQ_TIME_ACCOUNTING
    bool "Fine granularity task level IRQ time accounting"
    depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
    help
      Select this option to enable fine granularity task irq time
      accounting. This is done by reading a timestamp on each
      transitions between softirq and hardirq state, so there can be a
      small performance impact.

      If in doubt, say N here.

This feature is enabled in Fedora kernel configurations (see /boot/config-*). On x86 CPUs, it uses the TSC. The feature can be disabled with a boot-time option, tsc=noirqtime.[*]

More accurate accounting methods

As mentioned in the question, PowerPC / S390 have specific code that can account CPU time on every single context switch. This is called VIRT_CPU_ACCOUNTING_NATIVE. But your x86 kernel does not have this.

There is a generic equivalent, called VIRT_CPU_ACCOUNTING_GEN. (GEN is short for "generic"). This feature is built in to your Fedora kernel. But this feature is not activated by default.

You have to read carefully :-). VIRT_CPU_ACCOUNTING_GEN only becomes active on "full dynticks systems". Although the Fedora kernel configuration includes NO_HZ_FULL, Fedora does not enabled "full dynticks" by default. Enabling "full dynticks" requires specifying an option at boot time, nohz_full=, with a list of "adaptive-ticks CPUs". ("At least one non-adaptive-tick CPU must remain online ...")

See linux-5.2-rc5/init/Kconfig:

menu "CPU/Task time and stats accounting"

config VIRT_CPU_ACCOUNTING
    bool

choice
    prompt "Cputime accounting"
    default TICK_CPU_ACCOUNTING if !PPC64
    default VIRT_CPU_ACCOUNTING_NATIVE if PPC64

# Kind of a stub config for the pure tick based cputime accounting
config TICK_CPU_ACCOUNTING
    bool "Simple tick based cputime accounting"
    depends on !S390 && !NO_HZ_FULL
    help
      This is the basic tick based cputime accounting that maintains
      statistics about user, system and idle time spent on per jiffies
      granularity.

      If unsure, say Y.

config VIRT_CPU_ACCOUNTING_NATIVE
    bool "Deterministic task and CPU time accounting"
    depends on HAVE_VIRT_CPU_ACCOUNTING && !NO_HZ_FULL
    select VIRT_CPU_ACCOUNTING
    help
      Select this option to enable more accurate task and CPU time
      accounting.  This is done by reading a CPU counter on each
      kernel entry and exit and on transitions within the kernel
      between system, softirq and hardirq state, so there is a
      small performance impact.  In the case of s390 or IBM POWER > 5,
      this also enables accounting of stolen time on logically-partitioned
      systems.

config VIRT_CPU_ACCOUNTING_GEN
    bool "Full dynticks CPU time accounting"
    depends on HAVE_CONTEXT_TRACKING
    depends on HAVE_VIRT_CPU_ACCOUNTING_GEN
    depends on GENERIC_CLOCKEVENTS
    select VIRT_CPU_ACCOUNTING
    select CONTEXT_TRACKING
    help
      Select this option to enable task and CPU time accounting on full
      dynticks systems. This accounting is implemented by watching every
      kernel-user boundaries using the context tracking subsystem.
      The accounting is thus performed at the expense of some significant
      overhead.

      For now this is only useful if you are working on the full
      dynticks subsystem development.

      If unsure, say N.

endchoice

I marked a line through the last paragraph because it is outdated. "The full dynticks subsystem" has now been developed.

[*] TSC considerations

If an x86 CPU does not have a TSC, the kernel does not try to use any other hardware clock source for IRQ_TIME_ACCOUNTING (or for VIRT_CPU_ACCOUNTING_GEN).

The code suggests that any available TSC is accepted. I don't know how well this works with CPUs which do not have constant_tsc :-). Although I am 99.9% sure the relevant maintainers were aware of that issue, and would have asked why it was acceptable.

See native_sched_clock() and tsc_init():

/*
 * Fall back to jiffies if there's no TSC available:
 * ( But note that we still use it if the TSC is marked
 *   unstable. We do this because unlike Time Of Day,
 *   the scheduler clock tolerates small errors and it's
 *   very important for it to be as fast as the platform
 *   can achieve it. )
 */

Best Answer

More accurate accounting methods

[*] TSC considerations

Related Solutions

Linux – a lot of time spent in intel_idle

Related Question