Bash – Lazy regex using Bash

bashregular expression

I'm trying to match just the text contained within the HTML tags using Bash's built-in regex function:

string='<span class="circle"> </span>foo</span></span>'
regex='<span class="circle"> </span>(.+?)</span>'
[[ $string =~ $regex ]]
echo "${BASH_REMATCH[1]}"

But the match keeps capturing foo</span>.

The internet is so crowded with examples of sed and grep that I haven't found much documentation on Bash's own regex.

Best Answer

There is a reason why the internet is packed with alternative approaches. I can't really think of any situation where you would be forced to use bash for this. Why not use one of the tools designed for the job?

Anyway, as far as I know there is no way of doing non-greedy matches using the =~ operator. That's because it does not use bash's internal regex engine but your system's C one as defined in man 3 regex. This is explained in man bash:

   An additional binary operator, =~, is available, with the  same  prece‐
   dence  as  ==  and !=.  When it is used, the string to the right of the
   operator is considered  an  extended  regular  expression  and  matched
   accordingly  (as  in  regex(3)).  

You can, however, do more or less what you want (bearing in mind that this is really not a good way of parsing HTML files) with a slightly different regex:

string='<span class="circle"> </span>foo</span></span>'
regex='<span class="circle"> </span>([^<]+)</span>'
[[ $string =~ $regex ]]; 
echo "${BASH_REMATCH[1]}"

The above will return foo as expected.

Related Question