Bash – Lazy regex using Bash

bashregular expression

I'm trying to match just the text contained within the HTML tags using Bash's built-in regex function:

string='<span class="circle"> </span>foo</span></span>'
regex='<span class="circle"> </span>(.+?)</span>'
[[ $string =~ $regex ]]
echo "${BASH_REMATCH[1]}"

But the match keeps capturing foo</span>.

The internet is so crowded with examples of sed and grep that I haven't found much documentation on Bash's own regex.

Best Answer

There is a reason why the internet is packed with alternative approaches. I can't really think of any situation where you would be forced to use bash for this. Why not use one of the tools designed for the job?

Anyway, as far as I know there is no way of doing non-greedy matches using the =~ operator. That's because it does not use bash's internal regex engine but your system's C one as defined in man 3 regex. This is explained in man bash:

   An additional binary operator, =~, is available, with the  same  prece‐
   dence  as  ==  and !=.  When it is used, the string to the right of the
   operator is considered  an  extended  regular  expression  and  matched
   accordingly  (as  in  regex(3)).

You can, however, do more or less what you want (bearing in mind that this is really not a good way of parsing HTML files) with a slightly different regex:

string='<span class="circle"> </span>foo</span></span>'
regex='<span class="circle"> </span>([^<]+)</span>'
[[ $string =~ $regex ]]; 
echo "${BASH_REMATCH[1]}"

The above will return foo as expected.

Related Solutions

Bash Scripting – Using Regex Inside If Clause

You need to remove the quoting in the regex match.

if [[ ${str} =~ m\.m ]]; then

From the bash man page:

[...] An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expres‐ sion and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the conditional expression's return value is 2. If the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the pattern may be quoted to force it to be matched as a string.

So with the quotes, you're using good-old string matching.

If you need spaces in the pattern, just escape them:

str="m   m"
if [[ ${str} =~ m\ +m ]]; then

Bash – How to Use Regex Capture Groups

It's a shame that you can't do global matching in bash. You can do this:

global_rematch() { 
    local s=$1 regex=$2 
    while [[ $s =~ $regex ]]; do 
        echo "${BASH_REMATCH[1]}"
        s=${s#*"${BASH_REMATCH[1]}"}
    done
}
global_rematch "$mystring1" "$regex"

1BBBBBB
2AAAAAAA

This works by chopping the matched prefix off the string so the next part can be matched. It destroys the string, but in the function it's a local variable, so who cares.

I would actually use that function to populate an array:

$ mapfile -t matches < <( global_rematch "$mystring1" "$regex" )
$ printf "%s\n" "${matches[@]}"
1BBBBBB
2AAAAAAA

Best Answer

Related Solutions

Bash Scripting – Using Regex Inside If Clause

Bash – How to Use Regex Capture Groups

Related Question