The BEGIN
isn't superfluous. If you don't specify BEGIN
then the print
would be executed for every line of input.
Quoting from the manual:
A BEGIN
rule is executed once only, before the first input record is
read. Likewise, an END
rule is executed once only, after all the input
is read.
$ seq 5 | awk 'BEGIN{print "Hello"}/4/{print}' # Hello printed once
Hello
4
$ seq 5 | awk '{print "Hello"}/4/{print}' # Hello printed for each line of input
Hello
Hello
Hello
Hello
4
Hello
$
Here is a "intuitive" answer, for a more in depth explanation of awk's mechanism see either @Cuonglm's
In this case, !a[$0]++
, the post-increment ++
can be set aside for a moment, it does not change the value of the expression. So, look at only !a[$0]
. Here:
a[$0]
uses the current line $0
as key to the array a
, taking the value stored there. If this particular key was never referenced before, a[$0]
evaluates to the empty string.
!a[$0]
The !
negates the value from before. If it was empty or zero (false), we now have a true result. If it was non-zero (true), we have a false result. If the whole expression evaluated to true, meaning that a[$0]
was not set to begin with, the whole line is printed as the default action.
Also, regardless of the old value, the post-increment operator adds one to a[$0]
, so the next time the same value in the array is accessed, it will be positive and the whole condition will fail.
Best Answer
As Chris says, arguments of the form
variablename=anything
are treated as variable assignment (that are performed at the time the arguments are processed as opposed to the (newer)-v var=value
ones which are performed before theBEGIN
statements) instead of input file names.That can be useful in things like:
Where you can specify a different
FS
/RS
per file. It's also commonly used in:Which is a safer version of:
(which doesn't work if
file1
is empty)But that gets in the way when you have files whose name contains
=
characters.Now, that's only a problem when what's left of the first
=
is a validawk
variable name.What constitutes a valid variable name in
awk
is stricter than insh
.POSIX requires it to be something like:
With only characters of the portable character set. However, the
/usr/xpg4/bin/awk
of Solaris 11 at least is not compliant in that regard and allows any alphabetical characters in the locale in variable names, not just a-zA-Z.So an argument like
x+y=foo
or=bar
or./foo=bar
is still treated as an input file name and not an assignment as what's left of the first=
is not a valid variable name. An argument likeStéphane=Chazelas.txt
may or may not, depending on theawk
implementation and locale.That's why with awk, it's recommended to use:
instead of
for instance to avoid the problem if you can't guarantee the name of the
txt
files won't contain=
characters.Also, beware that an argument like
-vfoo=bar.txt
may be treated as an option if you use:(also applies to
awk '{code}' -vfoo=bar.txt
with theawk
from busybox versions prior to 1.28.0, see corresponding bug report).Again, using
./*.txt
works around that (using a./
prefix also helps with a file called-
which otherwiseawk
understands as meaning standard input instead).That's also why
shebangs don't really work. While the
var=value
ones can be worked around by fixing theARGV
values (add a./
prefix) in aBEGIN
statement:That won't help with the option ones as those ones are seen by
awk
and not theawk
script.One potential cosmetic issue with using that
./
prefix is it ends up inFILENAME
, but you can always usesubstr(FILENAME, 3)
to strip it if you don't want it.The GNU implementation of
awk
fixes all those issues with its-E
option.After
-E
, gawk expects only the path of theawk
script (where-
still means stdin) and then a list of input file paths only (and there, not even-
is treated specially).It's specially designed for:
shebangs where the list of arguments are always input files (note that you're still free to edit that
ARGV
list in aBEGIN
statement).You can also use it as:
We use
-E
with an empty script (/dev/null
) just to make sure those*.txt
afterwards are always treated as input files, even if they contain=
characters.