Why does ‘sed q’ work differently when reading from a pipe

pipesed

I created a test file named 'test' that contains the following:

xxx
yyy
zzz

I ran the command:

(sed '/y/ q'; echo aaa; cat) < test

and I got:

xxx
yyy
aaa
zzz

Then I ran:

cat test | (sed '/y/ q'; echo aaa; cat)

and got:

xxx
yyy
aaa

Question

sed reads and prints until it encounters a line with 'y', then stops. In the first case, but not the second, cat reads and prints the rest.

Can someone explain what phenomenon is behind this difference in behavior?

I also noticed it works this way in Ubuntu 16.04 and Centos 6 but in Centos 7 neither command prints 'zzz'.

Best Answer

When input file is seekable (like reading from regular file) or un-seekable (like reading from a pipe), sed (and other standard utilities) will behave differently (Read INPUT FILES section in this link).

Quote from the doc:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility.

So in:

(sed '/y/ q'; echo aaa; cat) < test

sed performed quit command before reaching EOF, so it left file offset at beginning of zzz line, so cat can continue printing the remain lines (GNU sed is not POSIX compliant in some condition, see below).

And continuing from the doc:

For files that are not seekable, the state of the file offset in the open file description for that file is unspecified

In this case, the behavior is unspecified. Most standard tools, include sed will consume the input as much as possible . It read pass the yyy line, and quit without restoring the file offset, so nothing is left for cat.


GNU sed is not compliant to the standard, depends on system's stdio implementation and glibc version:

$ (gsed '/y/ q'; echo aaa; cat) < test
xxx
yyy
aaa

Here, the result was got from Mac OSX 10.11.6, virtual machines Centos 7.2 - glibc 2.17, Ubuntu 14.04 - glibc 2.19, which are run on Openstack with CEPH backend.

On those systems, you can use -u option to achieve the standard behavior:

(gsed -u '/y/ q'; echo aaa; cat) </tmp/test

and for pipe:

$ cat test | (gsed -u '/y/ q'; echo aaa; cat)
xxx
yyy
aaa
zzz

which leads to terribly inefficient performance, because sed has to read one byte at a time. A partial output from strace:

$ strace -fe read sh -c '{ sed -u "/y/q"; echo aaa; cat; } <test'
...
[pid  5248] read(3, "", 4096)           = 0
[pid  5248] read(0, "x", 1)             = 1
[pid  5248] read(0, "x", 1)             = 1
[pid  5248] read(0, "x", 1)             = 1
[pid  5248] read(0, "\n", 1)            = 1
xxx
[pid  5248] read(0, "y", 1)             = 1
[pid  5248] read(0, "y", 1)             = 1
[pid  5248] read(0, "y", 1)             = 1
[pid  5248] read(0, "\n", 1)            = 1
yyy
...
Related Question