Why does ‘sed q’ work differently when reading from a pipe

pipesed

I created a test file named 'test' that contains the following:

xxx
yyy
zzz

I ran the command:

(sed '/y/ q'; echo aaa; cat) < test

and I got:

xxx
yyy
aaa
zzz

Then I ran:

cat test | (sed '/y/ q'; echo aaa; cat)

and got:

xxx
yyy
aaa

Question

sed reads and prints until it encounters a line with 'y', then stops. In the first case, but not the second, cat reads and prints the rest.

Can someone explain what phenomenon is behind this difference in behavior?

I also noticed it works this way in Ubuntu 16.04 and Centos 6 but in Centos 7 neither command prints 'zzz'.

Best Answer

When input file is seekable (like reading from regular file) or un-seekable (like reading from a pipe), sed (and other standard utilities) will behave differently (Read INPUT FILES section in this link).

Quote from the doc:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility.

So in:

(sed '/y/ q'; echo aaa; cat) < test

sed performed quit command before reaching EOF, so it left file offset at beginning of zzz line, so cat can continue printing the remain lines (GNU sed is not POSIX compliant in some condition, see below).

And continuing from the doc:

For files that are not seekable, the state of the file offset in the open file description for that file is unspecified

In this case, the behavior is unspecified. Most standard tools, include sed will consume the input as much as possible . It read pass the yyy line, and quit without restoring the file offset, so nothing is left for cat.

GNU sed is not compliant to the standard, depends on system's stdio implementation and glibc version:

$ (gsed '/y/ q'; echo aaa; cat) < test
xxx
yyy
aaa

Here, the result was got from Mac OSX 10.11.6, virtual machines Centos 7.2 - glibc 2.17, Ubuntu 14.04 - glibc 2.19, which are run on Openstack with CEPH backend.

On those systems, you can use -u option to achieve the standard behavior:

(gsed -u '/y/ q'; echo aaa; cat) </tmp/test

and for pipe:

$ cat test | (gsed -u '/y/ q'; echo aaa; cat)
xxx
yyy
aaa
zzz

which leads to terribly inefficient performance, because sed has to read one byte at a time. A partial output from strace:

$ strace -fe read sh -c '{ sed -u "/y/q"; echo aaa; cat; } <test'
...
[pid  5248] read(3, "", 4096)           = 0
[pid  5248] read(0, "x", 1)             = 1
[pid  5248] read(0, "x", 1)             = 1
[pid  5248] read(0, "x", 1)             = 1
[pid  5248] read(0, "\n", 1)            = 1
xxx
[pid  5248] read(0, "y", 1)             = 1
[pid  5248] read(0, "y", 1)             = 1
[pid  5248] read(0, "y", 1)             = 1
[pid  5248] read(0, "\n", 1)            = 1
yyy
...

Related Solutions

Linux sed Behavior – Why sed Acts Differently Depending on Output File

Why don't you just write

sed -i -e 's/a/a/g' messages.txt

the -i means "in place"

Shell Pipe Scheduling – Bash While Loop and Reading from Pipe

First, the program may do its own output buffering. This is sometimes called “stdio buffering” after the name of the library component that performs this task in C: the functions like putc, fputs, fprintf, etc., declared in stdio.h. If it does, it will produce output in bursts, typically of a few kilobytes (when the output is a terminal, the default is to flush the buffer at each newline character).

At some point, either the programmer or the underlying library function calls write explicitly. This requests that the kernel write the specified data into the pipe. The kernel may decide to write all or part of the data. Since the file is a pipe, the kernel will copy the data into the pipe's buffer area. If the pipe buffer is full, then the write system call blocks until there is room: your program (or more precisely, the thread that called write, in case there are several kernel-level threads) will not resume execution until the call to write returns.

(It is possible, but unlikely in this situtation, that the program has set the pipe's file descriptor as non-blocking. If this is the case, if the kernel determines that it can't copy any data, it will return control to the program; the write system call returns 0. A program that makes such non-blocking system calls would typically call select or poll or epoll in a loop to block until one of the file descriptors it's communicating on is ready for input or output.)

The fact that the program is blocked during a system call is not related to a choice of scheduling algorithm. At its core, any scheduler distinguishes between ready threads, which can be given CPU time, and waiting threads, which cannot. The gist of a scheduler is to choose a ready thread, and let it run until either the thread makes a system call (which puts the thread into a waiting state) or some asynchronous event occurs (in practice, a processor interrupt). During the processing of a system call, it may be that a thread that was until then blocked becomes ready, for example because that thread was in a write call and the kernel has now been able to deliver the data from that call. A few things can make a ready thread blocked from the outside, for example a signal to pause (SIGSTOP). The scheduler maintains some kind of ready list to decide which thread to schedule next: a list of threads that are ready (it is usually a lot more complicated than a simple list in a real-world scheduler).

Best Answer

Related Solutions

Linux sed Behavior – Why sed Acts Differently Depending on Output File

Shell Pipe Scheduling – Bash While Loop and Reading from Pipe

Related Question