Shell – Pipes, how do data flow in a pipeline

bufferpipeshellstdouttext processing

I don't understand how the data flows in the pipeline and hope someone could clarify what is going on there.

I thought a pipeline of commands processes files (text, arrays of strings) in line by line manner. (If each command itself works line by line.) Each line of text passes through the pipeline, commands don't wait for the previous to finish processing whole input.

But it seems it is not so.

Here is a test example. There are some lines of text. I uppercase them and repeat each line twice. I do so with cat text | tr '[:lower:]' '[:upper:]' | sed 'p'.

To follow the process we can run it "interactively" — skip the input filename in cat. Each part of the pipeline runs line by line:

$ cat | tr '[:lower:]' '[:upper:]'
alkjsd
ALKJSD
sdkj
SDKJ
$ cat | sed 'p'
line1
line1
line1
line 2
line 2
line 2

But the complete pipeline does wait for me to finish the input with EOF and only then prints the result:

$ cat | tr '[:lower:]' '[:upper:]' | sed 'p'
I am writing...
keep writing...
now ctrl-D
I AM WRITING...
I AM WRITING...
KEEP WRITING...
KEEP WRITING...
NOW CTRL-D
NOW CTRL-D

Is it supposed to be so? Why isn't it line-by-line?

Best Answer

There's a general buffering rule followed by the C standard I/O library (stdio) that most unix programs use. If output is going to a terminal, it is flushed at the end of each line; otherwise it is flushed only when the buffer (8K on my Linux/amd64 system; could be different on yours) is full.

If all your utilities were following the general rule, you would see output delayed in all of your examples (cat|sed, cat|tr, and cat|tr|sed). But there's an exception: GNU cat never buffers its output. It either doesn't use stdio or it changes the default stdio buffering policy.

I can be fairly sure you're using GNU cat and not some other unix cat because the others wouldn't behave this way. Traditional unix cat has a -u option to request unbuffered output. GNU cat ignores the -u option because its output is always unbuffered.

So whenever you have a pipe with a cat on the left, in the GNU system, the passage of data through the pipe will not be delayed. The cat isn't even going line by line - your terminal is doing that. While you're typing input for cat, your terminal is in "canonical" mode - line-based, with editing keys like backspace and ctrl-U offering you the chance to edit the line you have typed before sending it with Enter.

In the cat|tr|sed example, tr is still receiving data from cat as soon as you press Enter, but tr is following the stdio default policy: its output is going to a pipe, so it doesn't flush after each line. It writes to the second pipe when the buffer is full, or when an EOF is received, whichever comes first.

sed is also following the stdio default policy, but its output is going to a terminal so it will write each line as soon as it has finished with it. This has an effect on how much you must type before something shows up on the other end of the pipeline - if sed was block-buffering its output, you'd have to type twice as much (to fill tr's output buffer and sed's output buffer).

GNU sed has -u option so if you reversed the order and used cat|sed -u|tr you would see the output appear instantly again. (The sed -u option might be available elsewhere but I don't think it's an ancient unix tradition like cat -u) As far as I can tell there's no equivalent option for tr.

There is a utility called stdbuf which lets you alter the buffering mode of any command that uses the stdio defaults. It's a bit fragile since it uses LD_PRELOAD to accomplish something the C library wasn't designed to support, but in this case it seems to work:

cat | stdbuf -o 0 tr '[:lower:]' '[:upper:]' | sed 'p'

Related Solutions

Utility to buffer an unbounded amount of data in a pipeline

The pv (pipe viewer) utility can do this (with the -B option) and a lot more, including giving you progress reports.

Bash – Make GNU Parallel not delay before executing arguments from STDIN

A bug in GNU Parallel does, that it only starts processing after having read one job for each jobslot. After that it reads one job at a time.

In older versions the output will also be delayed by the number of jobslots. Newer versions only delay output by a single job.

So if you sent one job per second to parallel -j10 it would read 10 jobs before starting them. Older versions you would then have to wait an additional 10 seconds before seeing the output from job 3.

A workaround the limitation at start is to feed one dummy job per jobslot to parallel:

true >jobqueue; tail -n+0 -f jobqueue | parallel &
seq $(parallel --number-of-threads) | parallel -N0 echo true >> jobqueue
# now add the real jobs to jobqueue

A workound the output is to use --linebuffer (but this will mix full lines from different jobs).

Best Answer

Related Solutions

Utility to buffer an unbounded amount of data in a pipeline

Bash – Make GNU Parallel not delay before executing arguments from STDIN

Related Question