Shell Script – Head Command Eats Extra Characters

headpipeshell-scripttext processingUtilities

The following shell command was expected to print only odd lines of the input stream:

echo -e "aaa\nbbb\nccc\nddd\n" | (while true; do head -n 1; head -n 1 >/dev/null; done)

But instead it just prints the first line: aaa.

The same doesn't happen when it is used with -c (--bytes) option:

echo 12345678901234567890 | (while true; do head -c 5; head -c 5 >/dev/null; done)

This command outputs 1234512345 as expected. But this works only in the coreutils implementation of the head utility. The busybox implementation still eats extra characters, so the output is just 12345.

I guess this specific way of implementation is done for optimization purposes. You can't know where the line ends, so you don't know how many characters you need to read. The only way not to consume extra characters from the input stream is to read the stream byte by byte. But reading from the stream one byte at a time may be slow. So I guess head reads the input stream to a big enough buffer and then counts lines in that buffer.

The same can't be said for the case when --bytes option is used. In this case you know how many bytes you need to read. So you may read exactly this number of bytes and not more than that. The corelibs implementation uses this opportunity, but the busybox one does not, it still reads more byte than required into a buffer. It is probably done to simplify the implementation.

So the question. Is it correct for the head utility to consume more characters from the input stream than it was asked? Is there some kind of standard for Unix utilities? And if there is, does it specify this behavior?

PS

You have to press Ctrl+C to stop the commands above. The Unix utilities do not fail on reading beyond EOF. If you don't want to press, you may use a more complex command:

echo 12345678901234567890 | (while true; do head -c 5; head -c 5 | [ `wc -c` -eq 0 ] && break >/dev/null; done)

which I didn't use for simplicity.

Best Answer

Is it correct for the head utility to consume more characters from the input stream than it was asked?

Yes, it’s allowed (see below).

Is there some kind of standard for Unix utilities?

Yes, POSIX volume 3, Shell & Utilities.

And if there is, does it specify this behavior?

It does, in its introduction:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

head is one of the standard utilities, so a POSIX-conforming implementation has to implement the behaviour described above.

GNU head does try to leave the file descriptor in the correct position, but it’s impossible to seek on pipes, so in your test it fails to restore the position. You can see this using strace:

$ echo -e "aaa\nbbb\nccc\nddd\n" | strace head -n 1
...
read(0, "aaa\nbbb\nccc\nddd\n\n", 8192) = 17
lseek(0, -13, SEEK_CUR)                 = -1 ESPIPE (Illegal seek)
...

The read returns 17 bytes (all the available input), head processes four of those and then tries to move back 13 bytes, but it can’t. (You can also see here that GNU head uses an 8 KiB buffer.)

When you tell head to count bytes (which is non-standard), it knows how many bytes to read, so it can (if implemented that way) limit its read accordingly. This is why your head -c 5 test works: GNU head only reads five bytes and therefore doesn’t need to seek to restore the file descriptor’s position.

If you write the document to a file, and use that instead, you’ll get the behaviour you’re after:

$ echo -e "aaa\nbbb\nccc\nddd\n" > file
$ < file (while true; do head -n 1; head -n 1 >/dev/null; done)
aaa
ccc
Related Question