In POSIX awk and Gawk respectively, how can we find all the matches to a regular expression in a string?
More specifically, find all the matches that are substituted by gsub
builtin function, in terms of either of the following two objectives:
-
find the position and length of each match in the target string, and
-
find the matches as substrings of the target string only.
Achieving the first objective implies achieving the second objective.
-
In POSIX awk,
Is there a builtin function which can achieve either of the two
objectives?Does the
match
builtin function only find the leftmost and longest
match?To achieve the first objective, is it a correct way to repeatedly
applymatch
to the suffix of the target string created by finding
each match and removing the match and the prefix before it from
the target string? Is
https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct
implementation? -
In Gawk,
does
array
after a callpatsplit(string, array, fieldpat, seps)
store the matches as required in the second objective? Can the
locations of the match location be found fromarray
andseps
,
based on thatseps[i]
is the separator string betweenarray[i]
andarray[i+1]
?
Thanks.
Best Answer
No. You can achieve the same effect, but not with a single builtin function.
Yes. Regular expressions in POSIX
awk
(and GNUawk
) are always greedy (i.e. longest match always wins).Yes, but if you want 100% compatibility with
gsub()
the details are pretty tricky.Mostly, if you remove the gsub line. The devil is in the details: the code will loop if
regex
is an empty string. Classicawk
didn't allow empty regexps, but IIRCnawk
did. To fix that you could do something like this:That's not 100% compatible to
gsub()
however, becausewhile the function above finds only 3 matches (namely, it misses the match at the end).
You could try this instead:
This fixes the problem above, but it breaks other cases: if
str = "123"
andregex = "[1-9]*"
the function finds two occurrences,123
and the empty string at the end, whilegsub()
finds only one,123
.There may be other similar differences, that I can't be bothered to hunt right now.
Mostly yes. However, corner cases related to regexps can be unexpectedly subtle. There may be some differences, as above.
Yes.