Shell Pipe Scheduling – Bash While Loop and Reading from Pipe

pipeschedulingshell

I have a Windows command-line program that I'm running in a Bash script in Ubuntu via wine. The Bash script basically looks like this:

wine myprogram.exe | while read line
do
   # Process line
done

Now, since I've written myprogram.exe I know for a fact that it just spits out data as fast as it can. Can anyone explain to me how the Bash while loop is able to process the data in case my program spits it out faster than the while loop can handle? Is there some sorcery going on behind the scenes where the kernel scheduler will make myprogram.exe sleep if it produces too much data? Anyone? Currently I'm leaning towards it being black magic.

Best Answer

First, the program may do its own output buffering. This is sometimes called “stdio buffering” after the name of the library component that performs this task in C: the functions like putc, fputs, fprintf, etc., declared in stdio.h. If it does, it will produce output in bursts, typically of a few kilobytes (when the output is a terminal, the default is to flush the buffer at each newline character).

At some point, either the programmer or the underlying library function calls write explicitly. This requests that the kernel write the specified data into the pipe. The kernel may decide to write all or part of the data. Since the file is a pipe, the kernel will copy the data into the pipe's buffer area. If the pipe buffer is full, then the write system call blocks until there is room: your program (or more precisely, the thread that called write, in case there are several kernel-level threads) will not resume execution until the call to write returns.

(It is possible, but unlikely in this situtation, that the program has set the pipe's file descriptor as non-blocking. If this is the case, if the kernel determines that it can't copy any data, it will return control to the program; the write system call returns 0. A program that makes such non-blocking system calls would typically call select or poll or epoll in a loop to block until one of the file descriptors it's communicating on is ready for input or output.)

The fact that the program is blocked during a system call is not related to a choice of scheduling algorithm. At its core, any scheduler distinguishes between ready threads, which can be given CPU time, and waiting threads, which cannot. The gist of a scheduler is to choose a ready thread, and let it run until either the thread makes a system call (which puts the thread into a waiting state) or some asynchronous event occurs (in practice, a processor interrupt). During the processing of a system call, it may be that a thread that was until then blocked becomes ready, for example because that thread was in a write call and the kernel has now been able to deliver the data from that call. A few things can make a ready thread blocked from the outside, for example a signal to pause (SIGSTOP). The scheduler maintains some kind of ready list to decide which thread to schedule next: a list of threads that are ready (it is usually a lot more complicated than a simple list in a real-world scheduler).

Related Question