Linux – Understanding System ‘iowait’

linux

My basic assumption is that when a process' only limiting factors are disk and CPU, then total system "iowait" + CPU usage should equal at least 100% of one logical CPU. (In other cases this will not hold. E.g. when downloading a file using wget, the network is often the limiting factor).

This assumption is violated by a simple test. Is this expected? If it is expected, is there a set of conditions where I should expect my assumption to hold true?

There is some background about "iowait" here: How does a CPU know there is IO pending? The answer here quotes the counter-intuitive idea, that cumulative iowait "may decrease in certain conditions". I wonder if my simple test can be triggering such an undocumented condition?

UPDATE: Please skip to the answer.

The answer has a simpler test than the one I used originally. I preserved the original question below. The original question might show some additional details.

Original question

In a short test, I use dd to request the kernel to generate random bytes, and write them to a file. I run the dd command inside perf stat, just to get a count of the CPU time spent inside the kernel. I also run it inside perf trace -s, to report the time spent inside write(). At the same time, I run vmstat 5 in another terminal, to see the system "iowait".

I expected I would see at least one whole CPU as "non-idle", i.e. 100% of the time it is either running, or halted but waiting for IO ("iowait" state). It was not.
(Also, I was expecting to see "iowait" time roughly match the time spent in write(). But it did not appear to do so.)

The detailed results and test environment are shown below. Also shown is an alternative test, where my assumption did hold. Note: it was necessary to run perf stat inside perf trace, not the other way around. This is detailed here: Does "perf stat" (and "time" !) show incorrect results when running "perf trace -s" ?

Background information on "iowait"

Following is the definition taken from the sar manpage:

%iowait:

Percentage of time that the CPU or CPUs were idle during which the
system had an outstanding disk I/O request.

Therefore, %iowait means that from the CPU point of view, no tasks
were runnable, but at least one I/O was in progress. iowait is simply
a form of idle time when nothing could be scheduled. The value may or
may not be useful in indicating a performance problem, but it does
tell the user that the system is idle and could have taken more work.

https://support.hpe.com/hpsc/doc/public/display?docId=c02783994

There is also a longer article: Understanding I/O Wait (or why 0% Idle can be OK). This explains how you can see the definition clearly from the kernel code. The code has changed somewhat, but the idea is still clear:

/*
 * Account for idle time.
 * @cputime: the CPU time spent in idle wait
 */
void account_idle_time(u64 cputime)
{
    u64 *cpustat = kcpustat_this_cpu->cpustat;
    struct rq *rq = this_rq();

    if (atomic_read(&rq->nr_iowait) > 0)
        cpustat[CPUTIME_IOWAIT] += cputime;
    else
        cpustat[CPUTIME_IDLE] += cputime;
}

The article also shows a number of related experiments on a single-CPU system. Some of the experiments even use dd with if=/dev/urandom ! However the experiments do not include my test dd if=/dev/urandom of=test.out . It only uses dd if=/dev/urandom of=/dev/null .

"IO wait" is a bit more tricky to think about now because we use multi-CPU systems, but I think I still understand it, based on the quoted code.

Environment

I have four logical CPUs.

I use LVM, and the ext4 filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem.

The results below are from kernel 4.20.15-200.fc29.x86_64 , using the noop IO scheduler. The cfq IO scheduler also gives similar results.

(I have also seen similar results on a kernel build which was based on a similar configuration, but was closer to kernel version 5.1, and using mq-deadline. So that was using the new blk-mq code).

Test and results

$ sudo perf trace -s \
       perf stat \
       dd if=/dev/urandom of=test.out bs=1M oflag=direct count=3000

3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 31.397 s, 100 MB/s

 Performance counter stats for 'dd if=/dev/urandom of=test.out bs=1M oflag=direct count=3000':

         18,014.26 msec task-clock                #    0.574 CPUs utilized          
             3,199      context-switches          #    0.178 K/sec                  
                 4      cpu-migrations            #    0.000 K/sec                  
               328      page-faults               #    0.018 K/sec                  
    45,232,163,658      cycles                    #    2.511 GHz                    
    74,538,278,379      instructions              #    1.65  insn per cycle         
     4,372,725,344      branches                  #  242.737 M/sec                  
         4,650,429      branch-misses             #    0.11% of all branches        

      31.398466725 seconds time elapsed

       0.006966000 seconds user
      17.910332000 seconds sys

 Summary of events:
...
 dd (4620), 12156 events, 12.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   read                3007 17624.985     0.002     5.861    12.345      0.21%
   write               3003 13722.837     0.004     4.570   179.928      2.63%
   openat                12     0.371     0.002     0.031     0.267     70.36%
...

I read the iowait figure from the wa column of vmstat. You can tell when the test is running by looking at the io column (bo = 1K blocks output).

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 5126892 176512 1486060   0   0  1788  4072  321  414  4  4 83  9  0
 1  0      0 5126632 176520 1485988   0   0     0     7  212  405  0  1 99  0  0
 0  0      0 5126884 176520 1485988   0   0     0     0  130  283  0  0 99  0  0
 0  0      0 5126948 176520 1485908   0   0     0     1  157  325  0  0 99  0  0
 0  0      0 5126412 176520 1486412   0   0   115     0  141  284  0  0 99  0  0
 0  2      0 5115724 176548 1487056   0   0     0  6019 18737 10733  3  6 89  2  0
 1  0      0 5115708 176580 1487104   0   0     3 91840 1276  990  0 13 77  9  0
 1  0      0 5115204 176600 1487128   0   0     2 91382 1382 1014  0 14 81  4  0
 1  0      0 5115268 176636 1487084   0   0     4 88281 1257  901  0 14 83  3  0
 0  1      0 5113504 177028 1487764   0   0    77 92596 1374 1111  0 15 83  2  0
 1  0      0 5114008 177036 1487768   0   0     0 113282 1460 1060  0 16 81  2  0
 1  0      0 5113472 177044 1487792   0   0     0 110821 1489 1118  0 16 74 10  0
 0  0      0 5123852 177068 1487896   0   0     0 20537  631  714  1  3 94  2  0
 0  0      0 5123852 177076 1487856   0   0     0    10  324  529  2  1 98  0  0
 2  0      0 5123852 177084 1487872   0   0     0    70  150  299  0  0 99  0  0

Test results where it does hold (inside a VM)

I tried the same test inside a VM with 1 CPU, which was running the kernel 5.0.9-301.fc30.x86_64 and using mq-deadline (and hence blk-mq). In this test, it worked how I expected it to.

$ sudo perf trace -s \
       perf stat \
       dd if=/dev/urandom of=test.out bs=1M oflag=direct count=3000
[sudo] password for alan-sysop:
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 46.8071 s, 67.2 MB/s

 Performance counter stats for 'dd if=/dev/urandom of=test.out bs=1M oflag=direct count=3000':

         18,734.89 msec task-clock                #    0.400 CPUs utilized
            16,690      context-switches          #    0.891 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               328      page-faults               #    0.018 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      46.820355993 seconds time elapsed

       0.011840000 seconds user
      18.531449000 seconds sys


 Summary of events:
...
 dd (1492), 12156 events, 38.4%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   write               3003 28269.070     0.019     9.414  5764.657     22.39%
   read                3007 18371.469     0.013     6.110    14.848      0.53%
   execve                 6    10.399     0.012     1.733    10.328     99.18%
...

Output of vmstat 5:

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----                                                                     
 r  b  swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                                                                     
 0  0     0 726176  52128 498508    0    0  2040   231  236  731  7  5 77 11  0                                                                     
 0  0     0 726176  52136 498508    0    0     0    10   25   46  0  0 99  1  0                                                                     
 0  0     0 726208  52136 498508    0    0     0     0   29   56  0  0 100  0  0                                                                    
 0  1     0 702280  55944 511780    0    0  2260 13109 4399 9049  3 17 55 25  0                                                                     
 0  1     0 701776  56040 511960    0    0    18 129582 1406 1458 0 73  0 27  0                                                                    
 0  2     0 701524  56156 512168    0    0    22 87060  960  991  0 50  0 50  0                                                                     
 3  1     0 701524  56228 512328    0    0    14 118170 1301 1322 0 68  0 32  0                                                                    
 1  1     0 701272  56260 512392    0    0     6 86426  994  982  0 53  0 46  0                                                                     
 0  2     0 701020  56292 512456    0    0     6 56115  683  660  0 37  0 63  0                                                                     
 3  2     0 700540  56316 512504    0    0     5 33450  446  457  0 26  0 74  0                                                                     
 0  2     0 700860  56332 512536    0    0     3 16998  311  240  0 19  0 81  0                                                                     
 1  2     0 700668  56368 512616    0    0     7 32563  443  428  0 24  0 76  0                                                                     
 1  0     0 700668  56392 512648    0    0     3 20338  245  272  0 12  0 88  0                                                                   
 0  1     0 707096  56408 512920    0    0    54 20913  312  530  0 12 79  8  0                                                                     
 0  0     0 707064  56432 512920    0    0     0    49   39   64  0  0 45 55  0                                                                     
 0  0     0 707064  56432 512920    0    0     0     0   24   46  0  0 100  0  0                                                                    
 0  0     0 707064  56432 512920    0    0     0    80   28   47  0  0 100  0  0

I tried hot-adding a CPU to the VM and testing again. The results were variable: sometimes it showed about 0% in the idle column, and sometimes it showed about 50% idle (i.e. one out of two CPUs). In the case of 0% "idle", "iowait" was very high i.e. more than one CPUs' worth. I.e. my expectation point 2 was not correct. I can begrudgingly accept this apparent limitation of "iowait" on multi-CPU systems. (Though I don't quite understand it. If someone wants to explain it exactly, that would be great). However, "idle" was not above 50% in either case, so these tests were still consistent with my first assumption about "iowait".

I tried shutting the VM down and starting it with 4 CPUs. Similarly, often I had exactly 75% idle, and sometimes I had as low as 50% idle, but I did not see more than 75% idle (i.e. more than three out of four CPUs).

Whereas on the physical system with 4 CPUs, I can still reproduce the result of more than 80% idle as shown above.

Best Answer

Content notice: this post includes links to various Linux discussion and code. Some linked content does not meet the current Code of Conduct for StackExchange or for Linux. Mostly they "insult the code [but not the person]". However some language is used, that should simply not be repeated. I ask you to avoid imitating, parrotting, or debating such language.

Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low

On 05/07/2019 12:38, Peter Zijlstra wrote:

On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:

My cpu "iowait" time appears to be reported incorrectly. Do you know why this could happen?

Because iowait is a magic random number that has no sane meaning. Personally I'd prefer to just delete the whole thing, except ABI :/

Also see the comment near nr_iowait()

Thanks. I take [the problems mentioned in current documentation] as being different problems, but you mean there is not much demand (or point) to "fix" my issue.

I found my problem. It was already noticed five years ago, and it would not be trivial to fix.

"iowait" time is updated by the function account_idle_time():

/*
 * Account for idle time.
 * @cputime: the CPU time spent in idle wait
 */
void account_idle_time(u64 cputime)
{
    u64 *cpustat = kcpustat_this_cpu->cpustat;
    struct rq *rq = this_rq();

    if (atomic_read(&rq->nr_iowait) > 0)
        cpustat[CPUTIME_IOWAIT] += cputime;
    else
        cpustat[CPUTIME_IDLE] += cputime;
}

This works as I expected, if you are approximating cpu time by "sampling" with the traditional timer interrupt ("tick"). However, it may not work if the tick is turned off during idle time to save power - NO_HZ_IDLE. It may also fail if you allow the tick to be turned off for performance reasons - NO_HZ_FULL - because that requires starting VIRT_CPU_ACCOUNTING. Most Linux kernels use the power-saving feature. Some embedded systems do not use either feature. Here is my explanation:

When the IO is complete, the device sends an interrupt. The kernel interrupt handler wakes the process using try_to_wake_up(). It subtracts one from the nr_iowait counter:

if (p->in_iowait) {
    delayacct_blkio_end(p);
    atomic_dec(&task_rq(p)->nr_iowait);
}

If the process is woken on an idle CPU, that CPU calls account_idle_time(). Depending on which configuration applies, this is called either from tick_nohz_account_idle_ticks() from __tick_nohz_idle_restart_tick(), or from vtime_task_switch() from finish_task_switch().

By this time, ->nr_iowait has already been decremented. If it is reduced to zero, then no iowait time will be recorded.

This effect can vary: it depends which CPU the process is woken on. If the process is woken on the same CPU that received the IO completion interrupt, the idle time could be accounted earlier, before ->nr_iowait is decremented. In my case, I found CPU 0 handles the ahci interrupt, by looking at watch cat /proc/interrupts.

I tested this with a simple sequential read:

dd if=largefile iflag=direct bs=1M of=/dev/null

If I pin the command to CPU 0 using taskset -c 0 ..., I see "correct" values for iowait. If I pin it to a different CPU, I see much lower values. If I run the command normally, it varies depending on scheduler behaviour, which has changed between kernel versions. In recent kernels (4.17, 5.1, 5.2-rc5-ish), the command seems to spend about 1/4 of the time on CPU 0, because "iowait" time is reduced to that fraction.

(Not explained: why running this test on my virtual machine now seems to reproduce "correct" iowait, for each (or any) CPU. I suspect this might involve IRQ_TIME_ACCOUNTING, although this feature is also being used in my tests outside the VM.

I also have not confirmed exactly why suppressing NO_HZ_IDLE gives "correct" iowait for each CPU on 4.17+, but not on 4.16 or 4.15.

Running this test on my virtual machine seems to reproduce "correct" iowait, for each (or any) CPU. This is due to IRQ_TIME_ACCOUNTING. It is also used in the tests outside the VM, but I get more interrupts when testing inside the VM. Specifically, there are more than 1000 "Function call interrupts" per second on the virtual CPU that "dd" runs on.

So you should not rely too much on the details of my explanation :-)

There is some background about "iowait" here: How does a CPU know there is IO pending? The answer here quotes the counter-intuitive idea, that cumulative iowait "may decrease in certain conditions". I wonder if my simple test can be triggering such an undocumented condition?

Yes.

When I first looked this up, I found talk of "hiccups". Also, the problem was illustrated by showing the cumulative "iowait" time was non-monotonic. That is it sometimes jumped backwards (decreased). It was not as straightforward as the test above.

However, when they investigated they found the same fundamental problem. A solution was proposed and prototyped, by Peter Zijlstra and Hidetoshi Seto respectively. The problem is explained in the cover message:

[RFC PATCH 0/8] rework iowait accounting (2014-07-07)

I found no evidence of progress beyond this. There was an open question on one of the details. Also, the full series touched specific code for the PowerPC, S390, and IA64 CPU architectures. So I say this is not trivial to fix.

Related Solutions

CPU – How It Detects Pending IO in Linux

The CPU doesn’t know any of this, the task scheduler does.

The definition you quote is somewhat misleading; the current procfs(5) manpage has a more accurate definition, with caveats:

iowait (since Linux 2.5.41)

(5) Time waiting for I/O to complete. This value is not reliable, for the following reasons:

The CPU will not wait for I/O to complete; iowait is the time that a task is waiting for I/O to complete. When a CPU goes into idle state for outstanding task I/O, another task will be scheduled on this CPU.

On a multi-core CPU, the task waiting for I/O to complete is not running on any CPU, so the iowait of each CPU is difficult to calculate.

The value in this field may decrease in certain conditions.

iowait tries to measure time spent waiting for I/O, in general. It’s not tracked by a specific CPU, nor can it be (point 2 above — which also matches what you’re wondering about). It is measured per CPU though, as far as possible.

The task scheduler “knows” there is pending I/O, because it knows that it suspended a given task because it’s waiting for I/O. This is tracked per task in the in_iowait field of the task_struct; you can look for in_iowait in the scheduler core to see how it is set, tracked and cleared. Brendan Gregg’s recent article on Linux load averages includes useful background information. The iowait entry in /proc/stat, which is what ends up in top, is incremented whenever a timer tick is accounted for, and the current process “on” the CPU is idle; you can see this by looking for account_idle_time in the scheduler’s CPU time-tracking code.

So a more accurate definition would be “time spent on this CPU waiting for I/O, when there was nothing better to do”...

Linux – `umount -R` on bind mounts takes a non-neglible amount of time, why

So you think umount spends time waiting for something (as it spends very little cpu time in either user or sys). Let's find out why it waits...

# perf trace -g -e sched:* umount2 -R /mnt/a

perf record shows us hitting several scheduler tracepoints; it turned out the revealing one is sched:sched_switch.

Samples: 21  of event 'sched:sched_switch', Event count (approx.): 21
  Children      Self  Trace output                                                                                                                   ▒
-  100.00%   100.00%  umount:1888 [120] D ==> swapper/3:0 [120]                                                                                      ▒
     0                                                                                                                                               ▒
     __umount2                                                                                                                                       ▒
     entry_SYSCALL_64_fastpath                                                                                                                       ▒
     sys_umount                                                                                                                                      ▒
     do_umount                                                                                                                                       ▒
     namespace_unlock                                                                                                                                ▒
     synchronize_sched                                                                                                                               ▒
     __wait_rcu_gp                                                                                                                                   ▒
     wait_for_completion                                                                                                                             ▒
     schedule_timeout                                                                                                                                ▒
     schedule                                                                                                                                        ▒
     __schedule                                                                                                                                      ▒
     __schedule

__wait_rcu_gp() refers to an RCU grace period. namespace_unlock() in fs/namespace.c is some form of global synchronization, which includes synchronize_rcu(). It waits until all "currently executing RCU read-side critical sections have completed". "RCU grace periods extend for multiple milliseconds... this situation is a major reason for the rule of thumb that RCU be used in read-mostly situations". I suppose that mount namespaces are considered to be "read-mostly".

It looks this these "few milliseconds" account for the average wait of 5 milliseconds in each of the 34 calls to umount2().