The following shell command was expected to print only odd lines of the input stream:
echo -e "aaa\nbbb\nccc\nddd\n" | (while true; do head -n 1; head -n 1 >/dev/null; done)
But instead it just prints the first line: aaa
.
The same doesn't happen when it is used with -c
(--bytes
) option:
echo 12345678901234567890 | (while true; do head -c 5; head -c 5 >/dev/null; done)
This command outputs 1234512345
as expected. But this works only in the coreutils implementation of the head
utility. The busybox implementation still eats extra characters, so the output is just 12345
.
I guess this specific way of implementation is done for optimization purposes. You can't know where the line ends, so you don't know how many characters you need to read. The only way not to consume extra characters from the input stream is to read the stream byte by byte. But reading from the stream one byte at a time may be slow. So I guess head
reads the input stream to a big enough buffer and then counts lines in that buffer.
The same can't be said for the case when --bytes
option is used. In this case you know how many bytes you need to read. So you may read exactly this number of bytes and not more than that. The corelibs implementation uses this opportunity, but the busybox one does not, it still reads more byte than required into a buffer. It is probably done to simplify the implementation.
So the question. Is it correct for the head
utility to consume more characters from the input stream than it was asked? Is there some kind of standard for Unix utilities? And if there is, does it specify this behavior?
PS
You have to press Ctrl+C
to stop the commands above. The Unix utilities do not fail on reading beyond EOF
. If you don't want to press, you may use a more complex command:
echo 12345678901234567890 | (while true; do head -c 5; head -c 5 | [ `wc -c` -eq 0 ] && break >/dev/null; done)
which I didn't use for simplicity.
Best Answer
Yes, it’s allowed (see below).
Yes, POSIX volume 3, Shell & Utilities.
It does, in its introduction:
head
is one of the standard utilities, so a POSIX-conforming implementation has to implement the behaviour described above.GNU
head
does try to leave the file descriptor in the correct position, but it’s impossible to seek on pipes, so in your test it fails to restore the position. You can see this usingstrace
:The
read
returns 17 bytes (all the available input),head
processes four of those and then tries to move back 13 bytes, but it can’t. (You can also see here that GNUhead
uses an 8 KiB buffer.)When you tell
head
to count bytes (which is non-standard), it knows how many bytes to read, so it can (if implemented that way) limit its read accordingly. This is why yourhead -c 5
test works: GNUhead
only reads five bytes and therefore doesn’t need to seek to restore the file descriptor’s position.If you write the document to a file, and use that instead, you’ll get the behaviour you’re after: