$ printf 'asf .test. afd\nasaf foo-test asfdads\n'
asf .test. afd
asaf foo-test asfdads
$ printf 'asf .test. afd\nasaf foo-test asfdads\n' | grep -w test
asf .test. afd
asaf foo-test asfdads
Question: How can I match the "foo-test"? To be more precise, how can I say to "-w" use "-" as separator, but don't use "."?
Or in other words, can I tell grep
that .
is among the characters that make up words, and thus that there's no word boundary in between .
and test
?
Or are there other solutions than grep?
Best Answer
In versions prior to 2.19, GNU
grep
's-w
would only consider single-byte character alnums and underscore (so in UTF-8 locales, only the 26+26+10+1 (ASCII letters, digits and underscore)) as word constituents. So for instanceecho Stéphane | grep -w St
would match. That was fixed in 2.19.You could however implement the logic by hand:
That is
test
preceded by either a non-word-constituent or the beginning of the line and followed by either a non-word-constituent or the end of the line.(above
[:alnum:]
matches digits and letters in your locale, not only ASCII ones, fix the locale to C if you want only ASCII ones).If you don't want those surrounding non-word-constituents to be included in the match (for instance because you're using GNU's
-o
), you can this time use PCRE regexps and look-around operators:Remove
(*UCP)
and addLC_ALL=C
to match only ASCII letters and digits.Using
(*UCP)
at the start of a regexp tells the PCRE library that U̲niC̲ode P̲roperties have to be used for\w
.Without it,
\w
would match your locale's alphanumericals and underscore but only for single-byte characters. That wouldn't work in UTF-8 locales (the norm nowadays) where only ASCII ones would be matched.(*UCP)
makes it work for UTF-8 as well. It would match based on PCRE's own notion of character properties which might be different from your locale's, but on GNU systems, that's just as well as the UTF-8 locale definitions there are incomplete and outdated (at least as of 2015-04).