I read this in the Gawk manual:
GNU EXTENSIONS
[…]
The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().
However this seems to not be the case. This works as expected:
$ gawk 'BEGIN {print split("quebec", z, "")}'
6
and I can disable other extensions:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN {typeof(1)}'
gawk: cmd. line:1: fatal: function `typeof' not defined
but I cannot disable the split behavior:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN {print split("quebec", z, "")}'
6
$ gawk --posix 'BEGIN {print split("quebec", z, "")}'
6
I also looked a the Mawk manual:
If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.[…]
Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.
So, with what implementations can you not get single characters with FS
and
split
?
Best Answer
That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an
awk
implementation) can do whatever it wants if you do and still be POSIX. POSIX does not requireawk
to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.So
gawk
has no reason to change its behaviour in that regard when$POSIXLY_CORRECT
is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox
awk
(from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (thek
inawk
) (theFIXES
file refers togawk
, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.With Brian Kernighan's
awk
or the ones based on it like on FreeBSD or OpenBSD, note that while an emptyFS
or an empty third argument passed tosplit()
causes the string to be split into its individual characters (well, bytes, see below),awk -F ''
returns an error (awk -v FS=
is OK though).On Solaris, with both
nawk
and/usr/xpg4/bin/awk
(and also the old/bin/awk
from the 70s), an emptyFS
seems to disable splitting altogether.nawk -F ''
returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.Also note that
mawk
, bwk'sawk
(that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.
¹ I realise now that with POSIXLY_CORRECT, or
--posix
,gawk
disables some extensions that otherwise don't conflict with POSIX (typeof
does makegawk
non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disablenextfile
even though it does conflict with POSIX (awk '{nextfile = 1}'
is meant to assign 1 to thenextfile
variable but reports an error ingawk
even under POSIXLY_CORRECT).