Why does regex `“\.pdf”` match `/…/pdf…/…` in gawk, but not in mawk

awkgawkmawkregular expression

From How can I extract only the pid column and only the pathname column in the lsof output?

awk '{ for (i=9; i<=NF; i++) {
    if ($i ~ "string" && $1 != "wineserv" && $5 == "REG" && $NF ~ "\.pdf") {
        $1=$2=$3=$4=$5=$6=$7=$8=""
        print
    }
}}'

The regex "\.pdf" matches /.../pdf.../... in gawk, but not in mawk. I wonder why?

Thanks.

Best Answer

I don't think it's about the regex, but about how the double-quoted string is handled. C-style escapes (like \n) are interpreted in awk strings, and gawk and mawk treat invalid escapes differently:

$ mawk 'BEGIN { print "\."; }'
\.
$ gawk 'BEGIN { print "\."; }'
gawk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'
. 

That is, mawk seems to leave the backslash as-is, while gawk removes it (and complains, at least in my version). So, the actual regexes used are different: in gawk the regex is .pdf, which of course matches /pdf, since the dot matches any single character, while in mawk your regex is \.pdf, where the dot is escaped and matched literally.

GNU awk's manual explicitly mentions it's not portable to use a backslash before a character with no defined backslash-escape sequence (see the box "Backslash Before Regular Characters"):

If you place a backslash in a string constant before something that is not one of the characters previously listed, POSIX awk purposely leaves what happens as undefined. There are two choices:

Strip the backslash out
This is what BWK awk and gawk both do. For example, "a\qc" is the same as "aqc".
Leave the backslash alone
Some other awk implementations do this. In such implementations, typing "a\qc" is the same as typing "a\\qc".

I assume you want the dot to be escaped in the regex, so the safe ways are either $NF ~ "\\.pdf", or $NF ~ /\.pdf/ (since with the regex literal /.../, the escapes aren't "double processed").

The POSIX text also notes the double processing of the escapes:

If the right-hand operand [of ~ or !~] is any expression other than the lexical token ERE, the string value of the expression shall be interpreted as an extended regular expression, including the escape conventions described above. Note that these same escape conventions shall also be applied in determining the value of a string literal (the lexical token STRING), and thus shall be applied a second time when a string literal is used in this context.

So, this works in both gawk and mawk:

$ ( echo .pdf; echo /pdf ) |
  awk '{ if ($0 ~ "\\.pdf") print "   match: " $0; else print "no match: " $0; }'
   match: .pdf
no match: /pdf

as does this:

$ ( echo .pdf; echo /pdf ) |
  awk '{ if ($0 ~ /\.pdf/) print "   match: " $0; else print "no match: " $0; }'
   match: .pdf
no match: /pdf
Related Question