Why does regex `“\.pdf”` match `/…/pdf…/…` in gawk, but not in mawk

awkgawkmawkregular expression

From How can I extract only the pid column and only the pathname column in the lsof output?

awk '{ for (i=9; i<=NF; i++) {
    if ($i ~ "string" && $1 != "wineserv" && $5 == "REG" && $NF ~ "\.pdf") {
        $1=$2=$3=$4=$5=$6=$7=$8=""
        print
    }
}}'

The regex "\.pdf" matches /.../pdf.../... in gawk, but not in mawk. I wonder why?

Thanks.

Best Answer

I don't think it's about the regex, but about how the double-quoted string is handled. C-style escapes (like \n) are interpreted in awk strings, and gawk and mawk treat invalid escapes differently:

$ mawk 'BEGIN { print "\."; }'
\.
$ gawk 'BEGIN { print "\."; }'
gawk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'
.

That is, mawk seems to leave the backslash as-is, while gawk removes it (and complains, at least in my version). So, the actual regexes used are different: in gawk the regex is .pdf, which of course matches /pdf, since the dot matches any single character, while in mawk your regex is \.pdf, where the dot is escaped and matched literally.

GNU awk's manual explicitly mentions it's not portable to use a backslash before a character with no defined backslash-escape sequence (see the box "Backslash Before Regular Characters"):

If you place a backslash in a string constant before something that is not one of the characters previously listed, POSIX awk purposely leaves what happens as undefined. There are two choices:

Strip the backslash out
This is what BWK awk and gawk both do. For example, "a\qc" is the same as "aqc".
Leave the backslash alone
Some other awk implementations do this. In such implementations, typing "a\qc" is the same as typing "a\\qc".

I assume you want the dot to be escaped in the regex, so the safe ways are either $NF ~ "\\.pdf", or $NF ~ /\.pdf/ (since with the regex literal /.../, the escapes aren't "double processed").

The POSIX text also notes the double processing of the escapes:

If the right-hand operand [of ~ or !~] is any expression other than the lexical token ERE, the string value of the expression shall be interpreted as an extended regular expression, including the escape conventions described above. Note that these same escape conventions shall also be applied in determining the value of a string literal (the lexical token STRING), and thus shall be applied a second time when a string literal is used in this context.

So, this works in both gawk and mawk:

$ ( echo .pdf; echo /pdf ) |
  awk '{ if ($0 ~ "\\.pdf") print "   match: " $0; else print "no match: " $0; }'
   match: .pdf
no match: /pdf

as does this:

$ ( echo .pdf; echo /pdf ) |
  awk '{ if ($0 ~ /\.pdf/) print "   match: " $0; else print "no match: " $0; }'
   match: .pdf
no match: /pdf

Related Solutions

How to find all matches to a regular expression in a string

In POSIX awk,
Is there a builtin function which can achieve either of the two objectives?

No. You can achieve the same effect, but not with a single builtin function.

Does the match builtin function only find the leftmost and longest match?

Yes. Regular expressions in POSIX awk (and GNU awk) are always greedy (i.e. longest match always wins).

To achieve the first objective, is it a correct way to repeatedly apply match to the suffix of the target string created by finding each match and removing the match and the prefix before it from the target string?

Yes, but if you want 100% compatibility with gsub() the details are pretty tricky.

Is https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct implementation?

Mostly, if you remove the gsub line. The devil is in the details: the code will loop if regex is an empty string. Classic awk didn't allow empty regexps, but IIRC nawk did. To fix that you could do something like this:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (str != "" && match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

That's not 100% compatible to gsub() however, because

$ echo 123 | awk '{ gsub("", "-") } 1'
-1-2-3-

while the function above finds only 3 matches (namely, it misses the match at the end).

You could try this instead:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        if (str == "") break
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

This fixes the problem above, but it breaks other cases: if str = "123" and regex = "[1-9]*" the function finds two occurrences, 123 and the empty string at the end, while gsub() finds only one, 123.

There may be other similar differences, that I can't be bothered to hunt right now.

In Gawk,

does array after a call patsplit(string, array, fieldpat, seps) store the matches as required in the second objective?

Mostly yes. However, corner cases related to regexps can be unexpectedly subtle. There may be some differences, as above.

Can the locations of the match location be found from array and seps, based on that seps[i] is the separator string between array[i] and array[i+1]?

Yes.

PDF Text Processing – How to Get Page Numbers of a Pattern in PDF

It's a bit hacky, but since you are already using a perl compatible RE, you could use \K "keep left" modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:

pdfgrep -Pn '(?s)image\s+?not\s+?available.*?$\K'  main_text.pdf

The output will still include the : separator however.

Best Answer

Related Solutions

How to find all matches to a regular expression in a string

PDF Text Processing – How to Get Page Numbers of a Pattern in PDF

Related Question