AWK – Why AWK Stops if Filename Contains Equals Sign and Workaround

awkfilenames

awk 'processing_script_here' my=file.txt

seems to stop and wait indefinitely…
What's going on here and how do I make it work ?

Best Answer

As Chris says, arguments of the form variablename=anything are treated as variable assignment (that are performed at the time the arguments are processed as opposed to the (newer) -v var=value ones which are performed before the BEGIN statements) instead of input file names.

That can be useful in things like:

awk '{print $1}' FS=/ RS='\n' file1 FS='\n' RS= file2

Where you can specify a different FS/RS per file. It's also commonly used in:

awk '!file1_processed{a[$0]; next}; {...}' file1 file1_processed=1 file2

Which is a safer version of:

awk 'NR==FNR{a[$0]; next}; {...}' file1 file2

(which doesn't work if file1 is empty)

But that gets in the way when you have files whose name contains = characters.

Now, that's only a problem when what's left of the first = is a valid awk variable name.

What constitutes a valid variable name in awk is stricter than in sh.

POSIX requires it to be something like:

[_a-zA-Z][_a-zA-Z0-9]*

With only characters of the portable character set. However, the /usr/xpg4/bin/awk of Solaris 11 at least is not compliant in that regard and allows any alphabetical characters in the locale in variable names, not just a-zA-Z.

So an argument like x+y=foo or =bar or ./foo=bar is still treated as an input file name and not an assignment as what's left of the first = is not a valid variable name. An argument like Stéphane=Chazelas.txt may or may not, depending on the awk implementation and locale.

That's why with awk, it's recommended to use:

awk '...' ./*.txt

instead of

awk '...' *.txt

for instance to avoid the problem if you can't guarantee the name of the txt files won't contain = characters.

Also, beware that an argument like -vfoo=bar.txt may be treated as an option if you use:

awk -f file.awk -vfoo=bar.txt

(also applies to awk '{code}' -vfoo=bar.txt with the awk from busybox versions prior to 1.28.0, see corresponding bug report).

Again, using ./*.txt works around that (using a ./ prefix also helps with a file called - which otherwise awk understands as meaning standard input instead).

That's also why

#! /usr/bin/awk -f

shebangs don't really work. While the var=value ones can be worked around by fixing the ARGV values (add a ./ prefix) in a BEGIN statement:

#! /usr/bin/awk -f
BEGIN {
  for (i = 1; i < ARGC; i++)
    if (ARGV[i] ~ /^[_[:alpha:]][_[:alnum:]]*=/)
      ARGV[i] = "./" ARGV[i]
}
# rest of awk script

That won't help with the option ones as those ones are seen by awk and not the awk script.

One potential cosmetic issue with using that ./ prefix is it ends up in FILENAME, but you can always use substr(FILENAME, 3) to strip it if you don't want it.

The GNU implementation of awk fixes all those issues with its -E option.

After -E, gawk expects only the path of the awk script (where - still means stdin) and then a list of input file paths only (and there, not even - is treated specially).

It's specially designed for:

#! /usr/bin/gawk -E

shebangs where the list of arguments are always input files (note that you're still free to edit that ARGV list in a BEGIN statement).

You can also use it as:

gawk -e '...awk code here...' -E /dev/null *.txt

We use -E with an empty script (/dev/null) just to make sure those *.txt afterwards are always treated as input files, even if they contain = characters.

Related Question