I don't understand how the data flows in the pipeline and hope someone could clarify what is going on there.
I thought a pipeline of commands processes files (text, arrays of strings) in line by line manner. (If each command itself works line by line.) Each line of text passes through the pipeline, commands don't wait for the previous to finish processing whole input.
But it seems it is not so.
Here is a test example. There are some lines of text. I uppercase them and repeat each line twice. I do so with cat text | tr '[:lower:]' '[:upper:]' | sed 'p'
.
To follow the process we can run it "interactively" — skip the input filename in cat
. Each part of the pipeline runs line by line:
$ cat | tr '[:lower:]' '[:upper:]'
alkjsd
ALKJSD
sdkj
SDKJ
$ cat | sed 'p'
line1
line1
line1
line 2
line 2
line 2
But the complete pipeline does wait for me to finish the input with EOF
and only then prints the result:
$ cat | tr '[:lower:]' '[:upper:]' | sed 'p'
I am writing...
keep writing...
now ctrl-D
I AM WRITING...
I AM WRITING...
KEEP WRITING...
KEEP WRITING...
NOW CTRL-D
NOW CTRL-D
Is it supposed to be so? Why isn't it line-by-line?
Best Answer
There's a general buffering rule followed by the C standard I/O library (
stdio
) that most unix programs use. If output is going to a terminal, it is flushed at the end of each line; otherwise it is flushed only when the buffer (8K on my Linux/amd64 system; could be different on yours) is full.If all your utilities were following the general rule, you would see output delayed in all of your examples (
cat|sed
,cat|tr
, andcat|tr|sed
). But there's an exception: GNUcat
never buffers its output. It either doesn't usestdio
or it changes the defaultstdio
buffering policy.I can be fairly sure you're using GNU
cat
and not some other unixcat
because the others wouldn't behave this way. Traditional unixcat
has a-u
option to request unbuffered output. GNUcat
ignores the-u
option because its output is always unbuffered.So whenever you have a pipe with a
cat
on the left, in the GNU system, the passage of data through the pipe will not be delayed. Thecat
isn't even going line by line - your terminal is doing that. While you're typing input for cat, your terminal is in "canonical" mode - line-based, with editing keys like backspace and ctrl-U offering you the chance to edit the line you have typed before sending it with Enter.In the
cat|tr|sed
example,tr
is still receiving data fromcat
as soon as you press Enter, buttr
is following thestdio
default policy: its output is going to a pipe, so it doesn't flush after each line. It writes to the second pipe when the buffer is full, or when an EOF is received, whichever comes first.sed
is also following thestdio
default policy, but its output is going to a terminal so it will write each line as soon as it has finished with it. This has an effect on how much you must type before something shows up on the other end of the pipeline - ifsed
was block-buffering its output, you'd have to type twice as much (to filltr
's output buffer andsed
's output buffer).GNU
sed
has-u
option so if you reversed the order and usedcat|sed -u|tr
you would see the output appear instantly again. (Thesed -u
option might be available elsewhere but I don't think it's an ancient unix tradition likecat -u
) As far as I can tell there's no equivalent option fortr
.There is a utility called
stdbuf
which lets you alter the buffering mode of any command that uses thestdio
defaults. It's a bit fragile since it usesLD_PRELOAD
to accomplish something the C library wasn't designed to support, but in this case it seems to work: