Using tail
combined with other standard tools in Grouping Commands can make some powerful constructions. For example, to get the first and last line of a file:
$ seq 10 > file
$ { head -n1; tail -n1; } <file
1
10
When feeding file contents from a pipe to group commands, tail
fails to produce output, because a pipe is un-lseekable:
$ seq 10 | { head -n1; tail -n1; }
1
Now, when the content is big enough, tail
works:
$ seq 10000 | { head -n1; tail -n1; }
1
10000
That's because after the first lseek
failure, tail
know it's not a lseekable file descriptor and because the contents of the pipe haven't been read all yet, it starts to read the content till the end.
As the user point of view, I expect that the behavior should be consistent regardless of input content size. I've looked through POSIX tail
, lseek
documentation and didn't find out any description.
Is this behavior specified by POSIX? If not, how can I make the result to be always consistent?
I have tested with GNU tail and FreeBSD tail, both have the same behavior.
Best Answer
Note that the problem is not with
tail
but withhead
here which reads from the pipe more than the first line it is meant to output (so there's nothing left fortail
to read).And yes, it's POSIX conformant.
head
is required to leave the cursor within stdin just after the last line it has output when the input is seekable, but not otherwise.http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html:
For
head
to be able to do that for a non-seekable file would mean it would have to read one byte at a time which would be terribly inefficient¹. That's what theread
orline
utility do or GNUsed
with the-u
option.So you can replace
head -n 20
withgsed -u 20q
if you want that behaviour.Though here, you'd rather want:
instead. Here, only one tool invocation, so no problem with an internal buffer that can't be shared between two tool invocations. Note however that for large files, it's going to be less efficient as
sed
reads the whole file, while for seekable filestail
would skip most of it by seeking near the end of the file.See the related discussion about buffering at Why is using a shell loop to process text considered bad practice?.
Note thattail
must output the tail of the stream on stdin. While, as an optimisation and for seekable files, implementations may seek to the end of the file to get the trailing data from there, it is not allowed to seek back to a point that would be before the initial position at the timetail
was invoked (Busyboxtail
used to have that bug).So for instance in:
Even thoughtail
could seek back to the last line offile
, it does not. Its stdin is an empty stream ascat
left the cursor at the end of the file; it's not allowed to reclaim data from that stream by seeking further backward in the file.(Text above crossed out pending clarification by the Open Group and considering that it's not done correctly by several implementations)
¹ The
head
builtin ofksh93
(enabled if you put/opt/ast/bin
ahead of$PATH
), for sockets (a type of non-seekable files) instead peeks at the input (usingrecvfrom(..., MSG_PEEK)
) prior to actually reading it to see how much it needs to read to make sure it doesn't read too much. And falls back to reading one byte at a time for other types of files. That is slightly more efficient and I believe is the main reason why it implements its pipes withsocketpair()
s instead ofpipe()
. Note that it's not completely fool proof as there's a race condition that could be triggered if another process read from the socket in between the peek and the read.