- In POSIX awk,
Is there a builtin function which can achieve either of the two objectives?
No. You can achieve the same effect, but not with a single builtin function.
Does the match
builtin function only find the leftmost and longest match?
Yes. Regular expressions in POSIX awk
(and GNU awk
) are always greedy (i.e. longest match always wins).
To achieve the first objective, is it a correct way to repeatedly
apply match
to the suffix of the target string created by finding
each match and removing the match and the prefix before it from
the target string?
Yes, but if you want 100% compatibility with gsub()
the details are pretty tricky.
Is https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct
implementation?
Mostly, if you remove the gsub line. The devil is in the details: the code will loop if regex
is an empty string. Classic awk
didn't allow empty regexps, but IIRC nawk
did. To fix that you could do something like this:
function FindAllMatches(str, regex, match_arr) {
ftotal = 0;
ini = RSTART;
leng = RLENGTH;
delete match_arr;
while (str != "" && match(str, regex) > 0) {
match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
}
RSTART = ini;
RLENGTH = leng;
}
That's not 100% compatible to gsub()
however, because
$ echo 123 | awk '{ gsub("", "-") } 1'
-1-2-3-
while the function above finds only 3 matches (namely, it misses the match at the end).
You could try this instead:
function FindAllMatches(str, regex, match_arr) {
ftotal = 0;
ini = RSTART;
leng = RLENGTH;
delete match_arr;
while (match(str, regex) > 0) {
match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
if (str == "") break
str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
}
RSTART = ini;
RLENGTH = leng;
}
This fixes the problem above, but it breaks other cases: if str = "123"
and regex = "[1-9]*"
the function finds two occurrences, 123
and the empty string at the end, while gsub()
finds only one, 123
.
There may be other similar differences, that I can't be bothered to hunt right now.
In Gawk,
does array
after a call patsplit(string, array, fieldpat, seps)
store the matches as required in the second objective?
Mostly yes. However, corner cases related to regexps can be unexpectedly subtle. There may be some differences, as above.
Can the
locations of the match location be found from array
and seps
,
based on that seps[i]
is the separator string between array[i]
and array[i+1]
?
Yes.
It's a bit hacky, but since you are already using a perl compatible RE, you could use \K
"keep left" modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:
pdfgrep -Pn '(?s)image\s+?not\s+?available.*?$\K' main_text.pdf
The output will still include the :
separator however.
Best Answer
I don't think it's about the regex, but about how the double-quoted string is handled. C-style escapes (like
\n
) are interpreted in awk strings, and gawk and mawk treat invalid escapes differently:That is, mawk seems to leave the backslash as-is, while gawk removes it (and complains, at least in my version). So, the actual regexes used are different: in gawk the regex is
.pdf
, which of course matches/pdf
, since the dot matches any single character, while in mawk your regex is\.pdf
, where the dot is escaped and matched literally.GNU awk's manual explicitly mentions it's not portable to use a backslash before a character with no defined backslash-escape sequence (see the box "Backslash Before Regular Characters"):
I assume you want the dot to be escaped in the regex, so the safe ways are either
$NF ~ "\\.pdf"
, or$NF ~ /\.pdf/
(since with the regex literal/.../
, the escapes aren't "double processed").The POSIX text also notes the double processing of the escapes:
So, this works in both gawk and mawk:
as does this: