Linux – How does Linux kernel find out which process to wake up during interrupt handling

linux-kernelscheduling

I was reading the book Linux Kernel Development on Chapter Process Scheduling. On Page 61, section Waking Up, the first paragraph reads:

Waking is handled via wake_up(), which wakes up all the tasks waiting on the given wait queue. It (Q1:what does this it refer to?) calls try_to_wake_up(), which sets the task’s(Q2:which task? all awoken tasks?) state to TASK_RUNNING, calls enqueue_task() to add the task to the red-black tree, and sets need_resched if the awakened task’s priority is higher than the priority of the current task.The code that causes the event to occur typically calls wake_up() itself. For example, when data arrives from the hard disk, the VFS calls wake_up() on the wait queue that holds the processes waiting for the data.

I am quite confused about the above. Let me just use the example in the above paragraph, i.e., the disk interrupted after reading data, but with a more complete picture. Please correct me if any of the following is wrong or incomplete:

Some user process issued a blocking read operation, triggering a sys call and the process is in the realm of kernel.
Kernel sets up the disk controller requesting the needed data and puts this process on sleep (this process is put into a wait queue). Kernel schedules another process to run.
Disk interrupt occurs. CPU suspends the current executing process and jumps to the disk interrupt handling.
Disk controller will kick in sometime during the interrupt handling to transfer the data read from disk to main memory (either under the direction of CPU or by DMA)
(Not sure, please correct) As the paragraph says, VFS calls wake_up() on the wait queue that holds the processes waiting for the data.

My specific questions are the following:

Q1 (refer to the quoted paragraph): I assume the It in the second sentence refers to the function wake_up(). Why does the function wake_up wakes up all tasks instead of just the one waiting for this disk data?

Q2 (refer to the quoted paragraph): Does try_to_wake_up() somehow knows the specific task whose state needs to be set to TASK_RUNNING? Or try_to_wake_up() sets all awoken tasks' state to TASK_RUNNING?

Q3: How many wait queues are there for the kernel to manage? If there are more than 2 such wait queues, how does the kernel know which queue to select, such that the process waiting for the disk data is on that wait queue?

Q4: Now say we know the queue where the waiting process is on. How does the kernel know which process is waiting for the data from the disk. I can only image that some info specific to the process requesting the disk data is passed to the disk controller, like the process's PID, memory address or something. Then upon completing the interrupt handling, the disk controller(or kernel?) uses this info to pinpoint the process on the wait queue.

Please help me complete this picture of process wake_up! Thanks!

Best Answer

Q1: “It” is wake_up. It wakes up all tasks that are waiting for the disk data. If they weren't waiting for that data, they wouldn't be waiting on that queue.

Q2: I'm not sure I understand the question. Each wake queue entry contains a pointer to the task. try_to_wake_up receives a pointer to the task that it's supposed to wake up. It is called once per function.

Q3: There are lots of wait queues. There's one for every event that can happen. The disk driver sets up a wait queue for each request to the disk. For example, when the filesystem driver wants the content of a certain disk block, it asks the disk driver for that block, and then the request starts with the task that made the filesystem request. Other entries may be added to the wait queue if another request for the same block comes in while this one is still outstanding.

When an interrupt happens, the disk driver determines which disk has data available from the information passed by the hardware and looks up the data structure that contains the kernel data for this disk to find which request was to be filled. In this data structure, among others, are the location where the data is to be written and the corresponding wake queue indicating what to do next.

Q4: The process makes a system call, let's say to read a file. This triggers some code in the filesystem driver which determines that the data needs to be loaded from the disk. That code makes a request to the disk driver and adds the calling process to the request's wait queue. (There are actually more layers than that, but you get the idea.) When the disk read completes, the wait queue event triggers, and the process is thus removed from the disk's wait queue. The code triggered by the wait queue event is a function supplied by the filesystem driver, which copies the data to the process's memory and causes the read system call to return.

Related Solutions

Linux – How to find out which Linux kernel versions are the most prevalent

[Self answer]

While this is less than satisfying, we essentially went with @sjsam's advice and built a list of kernel versions by looking at the default kernel versions that ship with RedHat Enterprise Linux.

Looking at versions of RHEL that are still in support today (April 2016), this gives us the list:

2.6.18
2.6.32
3.10.0
4.X (just for good measure, I should test on whatever's current when we do the experiments)

I have no way of knowing how well this list represents what's "what's actually being used in real datacentres", but it's something. If we assume that datacentre engineers don't update their kernel from the default, then it's probably reasonably good.

Linux – WCHAN = 0 for a sleeping task

The WCHAN value seems to be computed. From kernel sources arch/x86/Kconfig:

config SCHED_OMIT_FRAME_POINTER 
   Single-depth WCHAN output
   Calculate simpler /proc/<PID>/wchan values. If this option
   is disabled then wchan values will recurse back to the
   caller function. This provides more accurate wchan values,
   at the expense of slightly more scheduling overhead.

and from Documentation/filesystem/proc.txt:

wchan Present with CONFIG_KALLSYMS=y: it shows the kernel function
      symbol the task is blocked in - or "0" if not blocked.

so I guess sometimes there is a miscomputation of wchan.

Best Answer

Related Solutions

Linux – How to find out which Linux kernel versions are the most prevalent

Linux – WCHAN = 0 for a sleeping task

Related Question