Shell Pipe Scheduling – Bash While Loop and Reading from Pipe

pipeschedulingshell

I have a Windows command-line program that I'm running in a Bash script in Ubuntu via wine. The Bash script basically looks like this:

wine myprogram.exe | while read line
do
   # Process line
done

Now, since I've written myprogram.exe I know for a fact that it just spits out data as fast as it can. Can anyone explain to me how the Bash while loop is able to process the data in case my program spits it out faster than the while loop can handle? Is there some sorcery going on behind the scenes where the kernel scheduler will make myprogram.exe sleep if it produces too much data? Anyone? Currently I'm leaning towards it being black magic.

Best Answer

First, the program may do its own output buffering. This is sometimes called “stdio buffering” after the name of the library component that performs this task in C: the functions like putc, fputs, fprintf, etc., declared in stdio.h. If it does, it will produce output in bursts, typically of a few kilobytes (when the output is a terminal, the default is to flush the buffer at each newline character).

At some point, either the programmer or the underlying library function calls write explicitly. This requests that the kernel write the specified data into the pipe. The kernel may decide to write all or part of the data. Since the file is a pipe, the kernel will copy the data into the pipe's buffer area. If the pipe buffer is full, then the write system call blocks until there is room: your program (or more precisely, the thread that called write, in case there are several kernel-level threads) will not resume execution until the call to write returns.

(It is possible, but unlikely in this situtation, that the program has set the pipe's file descriptor as non-blocking. If this is the case, if the kernel determines that it can't copy any data, it will return control to the program; the write system call returns 0. A program that makes such non-blocking system calls would typically call select or poll or epoll in a loop to block until one of the file descriptors it's communicating on is ready for input or output.)

The fact that the program is blocked during a system call is not related to a choice of scheduling algorithm. At its core, any scheduler distinguishes between ready threads, which can be given CPU time, and waiting threads, which cannot. The gist of a scheduler is to choose a ready thread, and let it run until either the thread makes a system call (which puts the thread into a waiting state) or some asynchronous event occurs (in practice, a processor interrupt). During the processing of a system call, it may be that a thread that was until then blocked becomes ready, for example because that thread was in a write call and the kernel has now been able to deliver the data from that call. A few things can make a ready thread blocked from the outside, for example a signal to pause (SIGSTOP). The scheduler maintains some kind of ready list to decide which thread to schedule next: a list of threads that are ready (it is usually a lot more complicated than a simple list in a real-world scheduler).

Related Solutions

Segfaulting Program – Piping Output from a Segfaulting Program

Programs typically buffer their output for efficiency. That is, they accumulate output in a memory area (called a buffer), and they actually get the output out only when the buffer is full or at certain key points in the program. When the program ends normally, it flushes the output buffer (i.e. prints out any data that's left in it). When it segfaults, the content of the buffer is lost.

You don't observe this effect when running the program directly in a terminal because the behavior is different when the program's output is connected to a terminal (as opposed to a regular file or a pipe). In a terminal, the default behavior is to flush the buffer at the end of each line. Therefore you'll see every complete line that's produced up to the point when the program segfaults.

You can force the program to run in a terminal and collect its output. The simplest way is to run script. There are a number of annoyances that you'll need to work around:

script adds a header line to the transcript file, which you'll need to remove afterwards.
script doesn't return the status code of the command, so you'll need to save it somewhere if you want to know about the segfault or any other error.
script will cause normal output and error out; you'd better save the error output to a separate file.

export FONT="foo"
script -q -c '
    ttf2afm "$FONT.ttf" 2>"$FONT.ttf2afm-err";
    echo $? >"$FONT.ttf2afm-status"
' "$FONT.ttf2afm-typescript"
tail -n +2 <"$FONT.ttf2afm-typescript" >"foo.afm"
rm "$FONT.ttf2afm-typescript"
if [ "$(cat "$FONT.ttf2afm-status")" -ne 0 ]; then
  echo 1>&2 "Warning: ttf2afm failed"
  cat "$FONT.ttf2afm-err"
fi

Bash – Check if a Pipe is Empty and Run a Command

There's no way to peek at the content of a pipe using commonly available shell utilities, nor is there a way to read a character from the pipe then put it back. The only way to know that a pipe has data is to read a byte, and then you have to get that byte to its destination.

So do just that: read one byte; if you detect an end of file, then do what you want to do when the input is empty; if you do read a byte then fork what you want to do when the input is not empty, pipe that byte into it, and pipe the rest of the data.

first_byte=$(dd bs=1 count=1 2>/dev/null | od -t o1 -A n | tr -dc 0-9)
if [ -z "$first_byte" ]; then
  # stuff to do if the input is empty
else
  {
    printf "\\$first_byte"
    cat
  } | {
    # stuff to do if the input is not empty
  }      
fi

The ifne utility from Joey Hess's moreutils runs a command if its input is not empty. It usually isn't installed by default, but it should be available or easy to build on most unix variants. If the input is empty, ifne does nothing and returns the status 0, which cannot be distinguished from the command running successfully. If you want to do something if the input is empty, you need to arrange for the command not to return 0, which can be done by having the success case return a distinguishable error status:

ifne sh -c 'do_stuff_with_input && exit 255'
case $? in
  0) echo empty;;
  255) echo success;;
  *) echo failure;;
esac

test -t 0 has nothing to do with this; it tests whether standard input is a terminal. It doesn't say anything one way or the other as to whether any input is available.

Best Answer

Related Solutions

Segfaulting Program – Piping Output from a Segfaulting Program

Bash – Check if a Pipe is Empty and Run a Command

Related Question