To find a space, you have to use [:space:]
inside another pair of brackets, which will look like [[:space:]]
. You probably meant to express grep -E '^[[:space:]]*h'
To explain why your current one fails:
As it stands, [:space:]*h
includes a character class looking for any of the characters: :
, s
, p
, a
, c
, and e
which occur any number of times (including 0), followed by h
. This matches your string just fine, but if you run grep -o
, you'll find that you've only matched the h
, not the space.
If you add a carat to the beginning, either one of those letters or h
must be at the beginning of the string to match, but none are, so it does not match.
If you want to select @
and up to the first ,
after that, you need to specify it as @[^,]*,
That is @
followed by any number (*
) of non-commas ([^,]
) followed by a comma (,
).
That approach works as the equivalent of @.*?,
, but not for things like @.*?string
, that is where what's after is more than a single character. Negating a character is easy, but negating strings in regexps is a lot more difficult.
A different approach is to pre-process your input to replace or prepend the string
with a character that otherwise doesn't occur in your input:
gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing
If you can't guarantee that the input won't contain your replacement character (\1
above), one approach is to use an escaping mechanism:
gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
# in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
# as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences
gsub(/@[^\2]*\2string/, "")
# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")
That works for fixed string
s but not for arbitrary regexps like for the equivalent of @.*?foo.bar
.
Best Answer
In non-GNU systems what follows explain why
\S
fail:The
\S
is part of a PCRE (Perl Compatible Regular Expressions). It is not part of the BRE (Basic Regular Expressions) or the ERE (Extended Regular Expressions) used in shells.The bash operator
=~
inside double bracket test[[
use ERE.The only characters with special meaning in ERE (as opposed to any normal character) are
.[\()*+?{|^$
. There are noS
as special. You need to construct the regex from more basic elements:Where the bracket expression
[^[:space:]]
is the equivalent to the\S
PCRE expressions :The default
\s
characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).The test would be:
However, the code above will match
bißß
for example. As the range[a-z]
will include other characters thanabcdefghijklmnopqrstuvwxyz
if the selected locale is (UNICODE). To avoid such issue, use:Please be aware that the code will match characters only in the list:
abcdefghijklmnopqrstuvwxyz
in the last character position, but still will match many other in the middle: e.g.bég
.Still, this use of
LC_ALL=C
will affect the other regex range:[[:space:]]
will match spaces only of the C locale.To solve all the issues, we need to keep each regex separate:
Which reads as:
b
and ends ina-z
(in the C locale).Note that both tests are done on the positive ranges (as opposed to a "not"-range). The reason is that negating a couple of characters opens up a lot more possible matches. The UNICODE v8 has 120,737 characters already assigned. If a range negates 17 characters, then it is accepting 120720 other possible characters, which may include many non-printable control characters.
It should be a good idea to limit the character range that the middle characters could have (yes, those will not be spaces, but may be anything else).