CPU – How It Detects Pending IO in Linux

cpulinuxload-averagetop

I have been looking into the iowait property shown in top utility output as shown below.

top - 07:30:58 up  3:37,   1 user,  load average: 0.00, 0.01, 0.05
Tasks:  86 total,   1 running,   85 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

iowait is generally defined as follows:

"It is the time during which CPU is idle and there is some IO pending."

It is my understanding that a process is run on a single CPU. After it gets de-scheduled either because it has used up its time slot or after it gets blocked, it can eventually be scheduled again on any one CPU again.

In case of IO request, a CPU that puts a process in uninterruptible sleep is responsible for tracking the iowait time. The other CPUs would be reporting the same time as idle time on their end as they really are idle. Is this assumption correct?

Furthermore, assuming there is a long IO request (meaning the process had several opportunities to get scheduled but didn't get scheduled because the IO wasn't complete), how does a CPU know there is "pending IO"? Where is that kind of information fetched from? How can a CPU simply find out that some process was put to sleep some time for an IO to complete as any of the CPUs could have put that process to sleep. How is this status of "pending IO" confirmed?

Best Answer

The CPU doesn’t know any of this, the task scheduler does.

The definition you quote is somewhat misleading; the current procfs(5) manpage has a more accurate definition, with caveats:

iowait (since Linux 2.5.41)

(5) Time waiting for I/O to complete. This value is not reliable, for the following reasons:

  1. The CPU will not wait for I/O to complete; iowait is the time that a task is waiting for I/O to complete. When a CPU goes into idle state for outstanding task I/O, another task will be scheduled on this CPU.

  2. On a multi-core CPU, the task waiting for I/O to complete is not running on any CPU, so the iowait of each CPU is difficult to calculate.

  3. The value in this field may decrease in certain conditions.

iowait tries to measure time spent waiting for I/O, in general. It’s not tracked by a specific CPU, nor can it be (point 2 above — which also matches what you’re wondering about). It is measured per CPU though, as far as possible.

The task scheduler “knows” there is pending I/O, because it knows that it suspended a given task because it’s waiting for I/O. This is tracked per task in the in_iowait field of the task_struct; you can look for in_iowait in the scheduler core to see how it is set, tracked and cleared. Brendan Gregg’s recent article on Linux load averages includes useful background information. The iowait entry in /proc/stat, which is what ends up in top, is incremented whenever a timer tick is accounted for, and the current process “on” the CPU is idle; you can see this by looking for account_idle_time in the scheduler’s CPU time-tracking code.

So a more accurate definition would be “time spent on this CPU waiting for I/O, when there was nothing better to do”...

Related Question