To quote the POSIX spec for awk:
When an ERE token appears as an expression in any context other than as the right-hand of the ~
or !~
operator or as one of the built-in function arguments described below, the value of the resulting expression shall be the equivalent of:
$0 ~ /ere/
This (combined with the action defaulting to { print }
) is why you can use awk
as a grep
substitute by just doing awk '/b/' <file
.
So, the answer is just "it's defined to work that way". /ere/
is defined to mean $0 ~ /ere/
except in certain circumstances, and /ere/ ~ $1
is not one of the exceptional circumstances, so it gets evaluated as ($0 ~ /ere/) ~ $1
.
I don't think it's about the regex, but about how the double-quoted string is handled. C-style escapes (like \n
) are interpreted in awk strings, and gawk and mawk treat invalid escapes differently:
$ mawk 'BEGIN { print "\."; }'
\.
$ gawk 'BEGIN { print "\."; }'
gawk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'
.
That is, mawk seems to leave the backslash as-is, while gawk removes it (and complains, at least in my version). So, the actual regexes used are different: in gawk the regex is .pdf
, which of course matches /pdf
, since the dot matches any single character, while in mawk your regex is \.pdf
, where the dot is escaped and matched literally.
GNU awk's manual explicitly mentions it's not portable to use a backslash before a character with no defined backslash-escape sequence (see the box "Backslash Before Regular Characters"):
If you place a backslash in a string constant before something that is not one of the characters previously listed, POSIX awk purposely leaves what happens as undefined. There are two choices:
Strip the backslash out
This is what BWK awk and gawk both do. For example, "a\qc"
is the same as "aqc"
.
Leave the backslash alone
Some other awk implementations do this. In such implementations, typing "a\qc"
is the same as typing "a\\qc"
.
I assume you want the dot to be escaped in the regex, so the safe ways are either $NF ~ "\\.pdf"
, or $NF ~ /\.pdf/
(since with the regex literal /.../
, the escapes aren't "double processed").
The POSIX text also notes the double processing of the escapes:
If the right-hand operand [of ~
or !~
] is any expression other than the lexical token ERE, the string value of the expression shall be interpreted as an extended regular expression, including the escape conventions described above. Note that these same escape conventions shall also be applied in determining the value of a string literal (the lexical token STRING), and thus shall be applied a second time when a string literal is used in this context.
So, this works in both gawk and mawk:
$ ( echo .pdf; echo /pdf ) |
awk '{ if ($0 ~ "\\.pdf") print " match: " $0; else print "no match: " $0; }'
match: .pdf
no match: /pdf
as does this:
$ ( echo .pdf; echo /pdf ) |
awk '{ if ($0 ~ /\.pdf/) print " match: " $0; else print "no match: " $0; }'
match: .pdf
no match: /pdf
Best Answer
The 0+ needs to be prefixed to each $1 to force a numeric conversion. max does not need 0+ -- it is already cast to numeric when it is stored.