AWK POSIX – Split Individual Characters Using Null String

awkgawkmawkposix

I read this in the Gawk manual:

GNU EXTENSIONS

[…]

The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().

However this seems to not be the case. This works as expected:

$ gawk 'BEGIN {print split("quebec", z, "")}'
6

and I can disable other extensions:

$ export POSIXLY_CORRECT
$ gawk 'BEGIN {typeof(1)}'
gawk: cmd. line:1: fatal: function `typeof' not defined

but I cannot disable the split behavior:

$ export POSIXLY_CORRECT
$ gawk 'BEGIN {print split("quebec", z, "")}'
6

$ gawk --posix 'BEGIN {print split("quebec", z, "")}'
6

I also looked a the Mawk manual:

If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.

[…]

Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.

So, with what implementations can you not get single characters with FS and
split?

Best Answer

That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.

So gawk has no reason to change its behaviour in that regard when $POSIXLY_CORRECT is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.

As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk (from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k in awk) (the FIXES file refers to gawk, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.

With Brian Kernighan's awk or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS or an empty third argument passed to split() causes the string to be split into its individual characters (well, bytes, see below), awk -F '' returns an error (awk -v FS= is OK though).

On Solaris, with both nawk and /usr/xpg4/bin/awk (and also the old /bin/awk from the 70s), an empty FS seems to disable splitting altogether. nawk -F '' returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.

Also note that mawk, bwk's awk (that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:

echo Stéphane | awk -v FS= '{print $4}'

would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.


¹ I realise now that with POSIXLY_CORRECT, or --posix, gawk disables some extensions that otherwise don't conflict with POSIX (typeof does make gawk non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile even though it does conflict with POSIX (awk '{nextfile = 1}' is meant to assign 1 to the nextfile variable but reports an error in gawk even under POSIXLY_CORRECT).

Related Question