How to find all matches to a regular expression in a string

awkgawkregular expressiontext processing

In POSIX awk and Gawk respectively, how can we find all the matches to a regular expression in a string?

More specifically, find all the matches that are substituted by gsub builtin function, in terms of either of the following two objectives:

  • find the position and length of each match in the target string, and

  • find the matches as substrings of the target string only.

Achieving the first objective implies achieving the second objective.

  1. In POSIX awk,

    Is there a builtin function which can achieve either of the two
    objectives?

    Does the match builtin function only find the leftmost and longest
    match?

    To achieve the first objective, is it a correct way to repeatedly
    apply match to the suffix of the target string created by finding
    each match and removing the match and the prefix before it from
    the target string? Is
    https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct
    implementation?

  2. In Gawk,

    does array after a call patsplit(string, array, fieldpat, seps)
    store the matches as required in the second objective? Can the
    locations of the match location be found from array and seps,
    based on that seps[i] is the separator string between array[i]
    and array[i+1]?

Thanks.

Best Answer

  1. In POSIX awk,
    Is there a builtin function which can achieve either of the two objectives?

No. You can achieve the same effect, but not with a single builtin function.

Does the match builtin function only find the leftmost and longest match?

Yes. Regular expressions in POSIX awk (and GNU awk) are always greedy (i.e. longest match always wins).

To achieve the first objective, is it a correct way to repeatedly apply match to the suffix of the target string created by finding each match and removing the match and the prefix before it from the target string?

Yes, but if you want 100% compatibility with gsub() the details are pretty tricky.

Is https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct implementation?

Mostly, if you remove the gsub line. The devil is in the details: the code will loop if regex is an empty string. Classic awk didn't allow empty regexps, but IIRC nawk did. To fix that you could do something like this:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (str != "" && match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

That's not 100% compatible to gsub() however, because

$ echo 123 | awk '{ gsub("", "-") } 1'
-1-2-3-

while the function above finds only 3 matches (namely, it misses the match at the end).

You could try this instead:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        if (str == "") break
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

This fixes the problem above, but it breaks other cases: if str = "123" and regex = "[1-9]*" the function finds two occurrences, 123 and the empty string at the end, while gsub() finds only one, 123.

There may be other similar differences, that I can't be bothered to hunt right now.

  1. In Gawk,

    does array after a call patsplit(string, array, fieldpat, seps) store the matches as required in the second objective?

Mostly yes. However, corner cases related to regexps can be unexpectedly subtle. There may be some differences, as above.

Can the locations of the match location be found from array and seps, based on that seps[i] is the separator string between array[i] and array[i+1]?

Yes.

Related Question