Shell Script – Head Command Eats Extra Characters

headpipeshell-scripttext processingUtilities

The following shell command was expected to print only odd lines of the input stream:

echo -e "aaa\nbbb\nccc\nddd\n" | (while true; do head -n 1; head -n 1 >/dev/null; done)

But instead it just prints the first line: aaa.

The same doesn't happen when it is used with -c (--bytes) option:

echo 12345678901234567890 | (while true; do head -c 5; head -c 5 >/dev/null; done)

This command outputs 1234512345 as expected. But this works only in the coreutils implementation of the head utility. The busybox implementation still eats extra characters, so the output is just 12345.

I guess this specific way of implementation is done for optimization purposes. You can't know where the line ends, so you don't know how many characters you need to read. The only way not to consume extra characters from the input stream is to read the stream byte by byte. But reading from the stream one byte at a time may be slow. So I guess head reads the input stream to a big enough buffer and then counts lines in that buffer.

The same can't be said for the case when --bytes option is used. In this case you know how many bytes you need to read. So you may read exactly this number of bytes and not more than that. The corelibs implementation uses this opportunity, but the busybox one does not, it still reads more byte than required into a buffer. It is probably done to simplify the implementation.

So the question. Is it correct for the head utility to consume more characters from the input stream than it was asked? Is there some kind of standard for Unix utilities? And if there is, does it specify this behavior?

You have to press Ctrl+C to stop the commands above. The Unix utilities do not fail on reading beyond EOF. If you don't want to press, you may use a more complex command:

echo 12345678901234567890 | (while true; do head -c 5; head -c 5 | [ `wc -c` -eq 0 ] && break >/dev/null; done)

which I didn't use for simplicity.

Best Answer

Is it correct for the head utility to consume more characters from the input stream than it was asked?

Yes, it’s allowed (see below).

Is there some kind of standard for Unix utilities?

Yes, POSIX volume 3, Shell & Utilities.

And if there is, does it specify this behavior?

It does, in its introduction:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

head is one of the standard utilities, so a POSIX-conforming implementation has to implement the behaviour described above.

GNU head does try to leave the file descriptor in the correct position, but it’s impossible to seek on pipes, so in your test it fails to restore the position. You can see this using strace:

$ echo -e "aaa\nbbb\nccc\nddd\n" | strace head -n 1
...
read(0, "aaa\nbbb\nccc\nddd\n\n", 8192) = 17
lseek(0, -13, SEEK_CUR)                 = -1 ESPIPE (Illegal seek)
...

The read returns 17 bytes (all the available input), head processes four of those and then tries to move back 13 bytes, but it can’t. (You can also see here that GNU head uses an 8 KiB buffer.)

When you tell head to count bytes (which is non-standard), it knows how many bytes to read, so it can (if implemented that way) limit its read accordingly. This is why your head -c 5 test works: GNU head only reads five bytes and therefore doesn’t need to seek to restore the file descriptor’s position.

If you write the document to a file, and use that instead, you’ll get the behaviour you’re after:

$ echo -e "aaa\nbbb\nccc\nddd\n" > file
$ < file (while true; do head -n 1; head -n 1 >/dev/null; done)
aaa
ccc

Related Solutions

Shell – does head input > output copy all invisible characters to the new file

POSIX says that the input to head is a text file, and defines a text file:

3.397 Text File

A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.

So there is a possibility of losing information.

How Do Pipelines Limit Memory Usage? – Detailed Explanation

The data doesn’t need to be stored in RAM. Pipes block their writers if the readers aren’t there or can’t keep up; under Linux (and most other implementations, I imagine) there’s some buffering but that’s not required. As mentioned by mtraceur and JdeBP (see the latter’s answer), early versions of Unix buffered pipes to disk, and this is how they helped limit memory usage: a processing pipeline could be split up into small programs, each of which would process some data, within the limits of the disk buffers. Small programs take less memory, and the use of pipes meant that processing could be serialised: the first program would run, fill its output buffer, be suspended, then the second program would be scheduled, process the buffer, etc. Modern systems are orders of magnitude larger than the early Unix systems, and can run many pipes in parallel; but for huge amounts of data you’d still see a similar effect (and variants of this kind of technique are used for “big data” processing).

In your example,

sed 'simplesubstitution' file | sort | uniq > file2

sed reads data from file as necessary, then writes it as long as sort is ready to read it; if sort isn’t ready, the write blocks. The data does indeed live in memory eventually, but that’s specific to sort, and sort is prepared to deal with any issues (it will use temporary files it the amount of data to sort is too large).

You can see the blocking behaviour by running

strace seq 1000000 -1 1 | (sleep 120; sort -n)

This produces a fair amount of data and pipes it to a process which isn’t ready to read anything for the first two minutes. You’ll see a number of write operations go through, but very quickly seq will stop and wait for the two minutes to elapse, blocked by the kernel (the write system call waits).

Best Answer

Related Solutions

Shell – does head input > output copy all invisible characters to the new file

How Do Pipelines Limit Memory Usage? – Detailed Explanation

Related Question