Unix/Linux – Understanding Piped Commands

pipeterminology

I have two simple programs: A and B. A would run first, then B gets the “stdout” of A and uses it as its “stdin”. Assume I am using a GNU/Linux operating system and the simplest possible way to do this would be:

./A | ./B

If I had to describe this command, I would say that it is a command that takes input (i.e., reads) from a producer (A) and writes to a consumer (B). Is that a correct description? Am I missing anything?

Best Answer

The only thing about your question that stands out as wrong is that you say

A would run first, then B gets the stdout of A

In fact, both programs would be started at pretty much the same time. If there's no input for B when it tries to read, it will block until there is input to read. Likewise, if there's nobody reading the output from A, its writes will block until its output is read (some of it will be buffered by the pipe).

The only thing synchronising the processes that take part in a pipeline is the I/O, i.e. the reading and writing across the pipe. If no writing or reading happens, then the two processes will run totally independent of each other. If one ignores the reading or writing of the other, the ignored process will block and eventually be killed by a SIGPIPE signal (if writing) or get an end-of-file condition on its standard input stream (if reading) when the other process terminates.

The idiomatic way to describe A | B is that it's a pipeline containing two programs. The output produced on standard output from the first program is available to be read on the standard input by the second ("[the output of] A is piped into [the input of] B"). The shell does the required plumbing to allow this to happen.

If you want to use the words "consumer" and "producer", I suppose that's ok too.

The fact that these are programs written in C is not relevant. The fact that this is Linux, macOS, OpenBSD or AIX is not relevant.

Related Solutions

In what order do piped commands run

Piped commands run concurrently. When you run ps | grep …, it's the luck of the draw (or a matter of details of the workings of the shell combined with scheduler fine-tuning deep in the bowels of the kernel) as to whether ps or grep starts first, and in any case they continue to execute concurrently.

This is very commonly used to allow the second program to process data as it comes out from the first program, before the first program has completed its operation. For example

grep pattern very-large-file | tr a-z A-Z

begins to display the matching lines in uppercase even before grep has finished traversing the large file.

grep pattern very-large-file | head -n 1

displays the first matching line, and may stop processing well before grep has finished reading its input file.

If you read somewhere that piped programs run in sequence, flee this document. Piped programs run concurrently and always have.

Linux – “Leaky” pipes in linux

Easiest way would be to pipe through some program which sets nonblocking output. Here is simple perl oneliner (which you can save as leakybuffer) which does so:

so your a | b becomes:

a | perl -MFcntl -e \
    'fcntl STDOUT,F_SETFL,O_NONBLOCK; while (<STDIN>) { print }' | b

what is does is read the input and write to output (same as cat(1)) but the output is nonblocking - meaning that if write fails, it will return error and lose data, but the process will continue with next line of input as we conveniently ignore the error. Process is kind-of line-buffered as you wanted, but see caveat below.

you can test with for example:

seq 1 500000 | perl -w -MFcntl -e \
    'fcntl STDOUT,F_SETFL,O_NONBLOCK; while (<STDIN>) { print }' | \
    while read a; do echo $a; done > output

you will get output file with lost lines (exact output depends on the speed of your shell etc.) like this:

you see where the shell lost lines after 12773, but also an anomaly - the perl didn't have enough buffer for 12774\n but did for 1277 so it wrote just that -- and so next number 75610 does not start at the beginning of the line, making it little ugly.

That could be improved upon by having perl detect when the write did not succeed completely, and then later try to flush remaining of the line while ignoring new lines coming in, but that would complicate perl script much more, so is left as an exercise for the interested reader :)

Update (for binary files): If you are not processing newline terminated lines (like log files or similar), you need to change command slightly, or perl will consume large amounts of memory (depending how often newline characters appear in your input):

perl -w -MFcntl -e 'fcntl STDOUT,F_SETFL,O_NONBLOCK; while (read STDIN, $_, 4096) { print }'

it will work correctly for binary files too (without consuming extra memory).

Update2 - nicer text file output: Avoiding output buffers (syswrite instead of print):

seq 1 500000 | perl -w -MFcntl -e \
    'fcntl STDOUT,F_SETFL,O_NONBLOCK; while (<STDIN>) { syswrite STDOUT,$_ }' | \
    while read a; do echo $a; done > output

seems to fix problems with "merged lines" for me:

(Note: one can verify on which lines output was cut with: perl -ne '$c++; next if $c==$_; print "$c $_"; $c=$_' output oneliner)

Best Answer

Related Solutions

In what order do piped commands run

Linux – “Leaky” pipes in linux

Related Question