Content notice: this post includes links to various Linux discussion and code. Some linked content does not meet the current Code of Conduct for StackExchange or for Linux. Mostly they "insult the code [but not the person]". However some language is used, that should simply not be repeated. I ask you to avoid imitating, parrotting, or debating such language.
Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low
On 05/07/2019 12:38, Peter Zijlstra wrote:
On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:
My cpu "iowait" time appears to be reported incorrectly. Do you know why
this could happen?
Because iowait is a magic random number that has no sane meaning.
Personally I'd prefer to just delete the whole thing, except ABI :/
Also see the comment near nr_iowait()
Thanks. I take [the problems mentioned in current documentation] as being different problems, but you mean there is not much demand (or point) to "fix" my issue.
I found my problem. It was already noticed five years ago, and it would not be trivial to fix.
"iowait" time is updated by the function account_idle_time()
:
/*
* Account for idle time.
* @cputime: the CPU time spent in idle wait
*/
void account_idle_time(u64 cputime)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
struct rq *rq = this_rq();
if (atomic_read(&rq->nr_iowait) > 0)
cpustat[CPUTIME_IOWAIT] += cputime;
else
cpustat[CPUTIME_IDLE] += cputime;
}
This works as I expected, if you are approximating cpu time by "sampling" with the traditional timer interrupt ("tick"). However, it may not work if the tick is turned off during idle time to save power - NO_HZ_IDLE
. It may also fail if you allow the tick to be turned off for performance reasons - NO_HZ_FULL
- because that requires starting VIRT_CPU_ACCOUNTING
. Most Linux kernels use the power-saving feature. Some embedded systems do not use either feature. Here is my explanation:
When the IO is complete, the device sends an interrupt. The kernel interrupt handler wakes the process using try_to_wake_up()
. It subtracts one from the nr_iowait
counter:
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
}
If the process is woken on an idle CPU, that CPU calls account_idle_time()
. Depending on which configuration applies, this is called either from tick_nohz_account_idle_ticks()
from __tick_nohz_idle_restart_tick()
, or from vtime_task_switch()
from finish_task_switch()
.
By this time, ->nr_iowait
has already been decremented. If it is reduced to zero, then no iowait time will be recorded.
This effect can vary: it depends which CPU the process is woken on. If the process is woken on the same CPU that received the IO completion interrupt, the idle time could be accounted earlier, before ->nr_iowait
is decremented. In my case, I found CPU 0 handles the ahci interrupt, by looking at watch cat /proc/interrupts
.
I tested this with a simple sequential read:
dd if=largefile iflag=direct bs=1M of=/dev/null
If I pin the command to CPU 0 using taskset -c 0 ...
, I see "correct" values for iowait. If I pin it to a different CPU, I see much lower values. If I run the command normally, it varies depending on scheduler behaviour, which has changed between kernel versions. In recent kernels (4.17, 5.1, 5.2-rc5-ish), the command seems to spend about 1/4 of the time on CPU 0, because "iowait" time is reduced to that fraction.
(Not explained: why running this test on my virtual machine now seems to reproduce "correct" iowait, for each (or any) CPU. I suspect this might involve IRQ_TIME_ACCOUNTING
, although this feature is also being used in my tests outside the VM.
I also have not confirmed exactly why suppressing NO_HZ_IDLE
gives "correct" iowait for each CPU on 4.17+, but not on 4.16 or 4.15.
Running this test on my virtual machine seems to reproduce "correct" iowait, for each (or any) CPU. This is due to IRQ_TIME_ACCOUNTING
. It is also used in the tests outside the VM, but I get more interrupts when testing inside the VM. Specifically, there are more than 1000 "Function call interrupts" per second on the virtual CPU that "dd" runs on.
So you should not rely too much on the details of my explanation :-)
There is some background about "iowait" here: How does a CPU know there is IO pending? The answer here quotes the counter-intuitive idea, that cumulative iowait "may decrease in certain conditions". I wonder if my simple test can be triggering such an undocumented condition?
Yes.
When I first looked this up, I found talk of "hiccups". Also, the problem was illustrated by showing the cumulative "iowait" time was non-monotonic. That is it sometimes jumped backwards (decreased). It was not as straightforward as the test above.
However, when they investigated they found the same fundamental problem. A solution was proposed and prototyped, by Peter Zijlstra and Hidetoshi Seto respectively. The problem is explained in the cover message:
[RFC PATCH 0/8] rework iowait accounting (2014-07-07)
I found no evidence of progress beyond this. There was an open question on one of the details. Also, the full series touched specific code for the PowerPC, S390, and IA64 CPU architectures. So I say this is not trivial to fix.
Best Answer
What you look for should be found inside this virtual file:
and the reverse in
From
drivers/base/cpu.c
we see that the source displayed is the kernel variablecpu_isolated_map
:and
cpu_isolated_map
is exactly what gets set bykernel/sched/core.c
at boot:But as you observed, someone could have modified the affinity of processes, including daemon-spawned ones,
cron
,systemd
and so on. If that happens, new processes will be spawned inheriting the modified affinity mask, not the one set byisolcpus
.So the above will give you
isolcpus
as you requested, but that might still not be helpful.Supposing that you find out that
isolcpus
has been issued, but has not "taken", this unwanted behaviour could be derived by some process realizing that it is bound to onlyCPU=0
, believing it is in monoprocessor mode by mistake, and helpfully attempting to "set things right" by resetting the affinity mask. If that was the case, you might try and isolate CPUS 0-5 instead of 1-6, and see whether this happens to work.