How to find all matches to a regular expression in a string

awkgawkregular expressiontext processing

In POSIX awk and Gawk respectively, how can we find all the matches to a regular expression in a string?

More specifically, find all the matches that are substituted by gsub builtin function, in terms of either of the following two objectives:

find the position and length of each match in the target string, and
find the matches as substrings of the target string only.

Achieving the first objective implies achieving the second objective.

In POSIX awk,

Is there a builtin function which can achieve either of the two
objectives?

Does the match builtin function only find the leftmost and longest
match?

To achieve the first objective, is it a correct way to repeatedly
apply match to the suffix of the target string created by finding
each match and removing the match and the prefix before it from
the target string? Is
https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct
implementation?
In Gawk,

does array after a call patsplit(string, array, fieldpat, seps)
store the matches as required in the second objective? Can the
locations of the match location be found from array and seps,
based on that seps[i] is the separator string between array[i]
and array[i+1]?

Thanks.

Best Answer

In POSIX awk,
Is there a builtin function which can achieve either of the two objectives?

No. You can achieve the same effect, but not with a single builtin function.

Does the match builtin function only find the leftmost and longest match?

Yes. Regular expressions in POSIX awk (and GNU awk) are always greedy (i.e. longest match always wins).

To achieve the first objective, is it a correct way to repeatedly apply match to the suffix of the target string created by finding each match and removing the match and the prefix before it from the target string?

Yes, but if you want 100% compatibility with gsub() the details are pretty tricky.

Is https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct implementation?

Mostly, if you remove the gsub line. The devil is in the details: the code will loop if regex is an empty string. Classic awk didn't allow empty regexps, but IIRC nawk did. To fix that you could do something like this:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (str != "" && match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

That's not 100% compatible to gsub() however, because

$ echo 123 | awk '{ gsub("", "-") } 1'
-1-2-3-

while the function above finds only 3 matches (namely, it misses the match at the end).

You could try this instead:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        if (str == "") break
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

This fixes the problem above, but it breaks other cases: if str = "123" and regex = "[1-9]*" the function finds two occurrences, 123 and the empty string at the end, while gsub() finds only one, 123.

There may be other similar differences, that I can't be bothered to hunt right now.

In Gawk,

does array after a call patsplit(string, array, fieldpat, seps) store the matches as required in the second objective?

Mostly yes. However, corner cases related to regexps can be unexpectedly subtle. There may be some differences, as above.

Can the locations of the match location be found from array and seps, based on that seps[i] is the separator string between array[i] and array[i+1]?

Yes.

Related Solutions

POSIX Array – Portable Check for Array in Awk

I also don't think it's possible.

But I'll add that with busybox awk, variables can be both arrays and scalar. There it's OK to do:

a = "foo"; a["foo"] = "bar"

When a variable has been used as an array though, length() returns the number of elements in the array, even if it also has been defined as a scalar (though you can use length(var "") to get the length of the scalar), except when the variable has been passed as an argument to a function and assigned as a scalar there (could be considered as a bug):

$ busybox awk 'BEGIN{a[1] = 1; a = "foo"; print length(a), length(a"")}'
1 3
$ busybox awk 'function f(x) {x = "xxx"; print x[1], length(x)}
               BEGIN{a[1]=1; x = "yyy"; print a[1], length(a); f(a)}'
1 1
1 3

Too bad as otherwise it would have been easy to define a isarray() function there. We can still tell if a variable is an array with at least one element with

function isnonemptyarray(x) {
  return length(x) > 0 && length(x "") == 0
}

(assuming the variable hasn't been defined both as an array and scalar)

In anycase, that's busybox awk specific. length() can't be used on arrays portably. One can define a portable array_length() function with:

function array_length(a, tmp1, tmp2) {
  tmp1 = 0
  for (tmp2 in a) tmp1++
  return tmp1
}

But that can't be used portably on non-array variables.

Where to get the new string after running `sub` in awk

From the GNU awk manual 9.1.3 String-Manipulation Functions:

... the third argument to sub() must be a variable, field, or array element. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub() still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions of awk accept expressions like the following:

sub(/USA/, "United States", "the USA and Canada")

For historical compatibility, gawk accepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.

So, the answer is to use a variable:

awk 'BEGIN{t = "banana"; sub(/ana/,"anda",t); print t}'
bandana

Best Answer

Related Solutions

POSIX Array – Portable Check for Array in Awk

Where to get the new string after running `sub` in awk

Related Question