Awk – Is Awk’s Nextfile Specified in POSIX?

awkposix

GNU awk manual on nextfile reads:

NOTE: For many years, nextfile was a common extension. In September 2012, it was accepted for inclusion into the POSIX standard. See the Austin Group website.

Likewise, mawk manual says:

Nextfile is a gawk extension (also implemented by BWK awk), is not yet
part of the POSIX standard (as of October 2012), although it has been
accepted for the next revision of the standard.

What confuses me is that there is no mention of nextfile in the latest POSIX specification, from 2018.

Following the link to the Austin Group, you find that the issue was resolved in 2012 (with even a final accepted text), but only applied in 2020(!).

All in all, does it mean nextfile is an awk's feature specified by POSIX? Or will it only be so in a future POSIX version?

(For practical purposes, nextfile is also to be found in BSD awk.)

Two more statements are in the same situation as nextfile: fflush and delete (delete is already specified, but is to be expanded so as to be able to delete an entire array).

Best Answer

You'll see that bug 607 is targetted for Issue 8, not released yet (see the issue8 Tags).

Issue 7 was released in 2008, there have been a few newer editions of issue 7, latest in 2018, but those are technical corrigenda, they don't bring new features.

nextfile is not only a new feature but also breaks backward compatibility as awk '{nextfile = 1}' and awk '{nextfile}' are valid awk invocations which in the current POSIX version set and retrieve the value of a nextfile variable respectively, so it could possibly not be added as part of a technical corrigendum.

What could be added (and probably should have) in a TC is to tell people that nextfile is a word reserved for future use so that people should not use it in their variable or function names, as a script that does awk '{nextfile = 1}', though perfectly standard, does not work in many awk implementations (that's not limited to nextfile btw).

You can check a HTML rendition of the awk part of the 2018 edition of Issue 7 of the Single UNIX Specification at https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/utilities/awk.html (note the .2018edition part), though note that even though it is published by the Opengroup, the HTML version has no value of standard, only the PDF does (you need to register with them to get access to it).

They're meant to be equivalent, though there have been several bugs in the conversion to HTML in the past which have caused sections to be missing (though they're generally fixed quickly when spotted), so when in doubt, best is to check the PDF.

Related Solutions

Are “mostly POSIX-compliant” systems still considered POSIX systems

No. The Open Group has an actual POSIX certification process, so if an operating system hasn't been through that, it cannot be referred to as POSIX-compliant.

Is this tail behavior in Grouping Commands specified by POSIX

Note that the problem is not with tail but with head here which reads from the pipe more than the first line it is meant to output (so there's nothing left for tail to read).

And yes, it's POSIX conformant.

head is required to leave the cursor within stdin just after the last line it has output when the input is seekable, but not otherwise.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

For head to be able to do that for a non-seekable file would mean it would have to read one byte at a time which would be terribly inefficient¹. That's what the read or line utility do or GNU sed with the -u option.

So you can replace head -n 20 with gsed -u 20q if you want that behaviour.

Though here, you'd rather want:

sed -e 1b -e '$b' -e d

instead. Here, only one tool invocation, so no problem with an internal buffer that can't be shared between two tool invocations. Note however that for large files, it's going to be less efficient as sed reads the whole file, while for seekable files tail would skip most of it by seeking near the end of the file.

See the related discussion about buffering at Why is using a shell loop to process text considered bad practice?.

Note that tail must output the tail of the stream on stdin. While, as an optimisation and for seekable files, implementations may seek to the end of the file to get the trailing data from there, it is not allowed to seek back to a point that would be before the initial position at the time tail was invoked (Busybox tail used to have that bug).

So for instance in:

{ cat; tail -n 1; } < file

Even though tail could seek back to the last line of file, it does not. Its stdin is an empty stream as cat left the cursor at the end of the file; it's not allowed to reclaim data from that stream by seeking further backward in the file.

^{(Text above crossed out pending clarification by the Open Group and considering that it's not done correctly by several implementations)}

^{¹ The head builtin of ksh93 (enabled if you put /opt/ast/bin ahead of $PATH), for sockets (a type of non-seekable files) instead peeks at the input (using recvfrom(..., MSG_PEEK)) prior to actually reading it to see how much it needs to read to make sure it doesn't read too much. And falls back to reading one byte at a time for other types of files. That is slightly more efficient and I believe is the main reason why it implements its pipes with socketpair()s instead of pipe(). Note that it's not completely fool proof as there's a race condition that could be triggered if another process read from the socket in between the peek and the read.}

Best Answer

Related Solutions

Are “mostly POSIX-compliant” systems still considered POSIX systems

Is this tail behavior in Grouping Commands specified by POSIX

Related Question