Bash – Any non-whitespace regular expression

bashregular expression

Im trying to match a string agains a regular expression inside an if statement on bash. Code below:

var='big'
If [[ $var =~ ^b\S+[a-z]$ ]]; then 
echo $var
else 
echo 'none'
fi

Match should be a string that starts with 'b' followed by one or more non-whitespace character and ending on a letter a-z. I can match the start and end of the string but the \S is not working to match the non-whitespace characters. Thanks in advance for the help.

Best Answer

In non-GNU systems what follows explain why \S fail:

The \S is part of a PCRE (Perl Compatible Regular Expressions). It is not part of the BRE (Basic Regular Expressions) or the ERE (Extended Regular Expressions) used in shells.

The bash operator =~ inside double bracket test [[ use ERE.

The only characters with special meaning in ERE (as opposed to any normal character) are .[\()*+?{|^$. There are no S as special. You need to construct the regex from more basic elements:

regex='^b[^[:space:]]+[a-z]$'

Where the bracket expression [^[:space:]] is the equivalent to the \S PCRE expressions :

The default \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).

The test would be:

var='big'            regex='^b[^[:space:]]+[a-z]$'

[[ $var =~ $regex ]] && echo "$var" || echo 'none'

However, the code above will match bißß for example. As the range [a-z] will include other characters than abcdefghijklmnopqrstuvwxyz if the selected locale is (UNICODE). To avoid such issue, use:

var='bißß'            regex='^b[^[:space:]]+[a-z]$'

( LC_ALL=C;
  [[ $var =~ $regex ]]; echo "$var" || echo 'none'
)

Please be aware that the code will match characters only in the list: abcdefghijklmnopqrstuvwxyz in the last character position, but still will match many other in the middle: e.g. bég.

Still, this use of LC_ALL=C will affect the other regex range: [[:space:]] will match spaces only of the C locale.

To solve all the issues, we need to keep each regex separate:

reg1=[[:space:]]   reg2='^b.*[a-z]$'           out=none

if                 [[ $var =~ $reg1 ]]  ; then out=none
elif   ( LC_ALL=C; [[ $var =~ $reg2 ]] ); then out="$var"
fi
printf '%6.8s\t|' "$out"

Which reads as:

If the input (var) has no spaces (in the present locale) then
check that it start with a b and ends in a-z (in the C locale).

Note that both tests are done on the positive ranges (as opposed to a "not"-range). The reason is that negating a couple of characters opens up a lot more possible matches. The UNICODE v8 has 120,737 characters already assigned. If a range negates 17 characters, then it is accepting 120720 other possible characters, which may include many non-printable control characters.

It should be a good idea to limit the character range that the middle characters could have (yes, those will not be spaces, but may be anything else).

Related Solutions

Grep caret appears to have no effect

To find a space, you have to use [:space:] inside another pair of brackets, which will look like [[:space:]]. You probably meant to express grep -E '^[[:space:]]*h'

To explain why your current one fails:

As it stands, [:space:]*h includes a character class looking for any of the characters: :, s, p, a, c, and e which occur any number of times (including 0), followed by h. This matches your string just fine, but if you run grep -o, you'll find that you've only matched the h, not the space.

If you add a carat to the beginning, either one of those letters or h must be at the beginning of the string to match, but none are, so it does not match.

AWK Regular Expressions – Reduce Greediness

If you want to select @ and up to the first , after that, you need to specify it as @[^,]*,

That is @ followed by any number (*) of non-commas ([^,]) followed by a comma (,).

That approach works as the equivalent of @.*?,, but not for things like @.*?string, that is where what's after is more than a single character. Negating a character is easy, but negating strings in regexps is a lot more difficult.

A different approach is to pre-process your input to replace or prepend the string with a character that otherwise doesn't occur in your input:

gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing

If you can't guarantee that the input won't contain your replacement character (\1 above), one approach is to use an escaping mechanism:

gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
                   # in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
                   # as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences

gsub(/@[^\2]*\2string/, "")

# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")

That works for fixed strings but not for arbitrary regexps like for the equivalent of @.*?foo.bar.

Best Answer

Related Solutions

Grep caret appears to have no effect

AWK Regular Expressions – Reduce Greediness

Related Question