Grep – How to Define Word Characters for grep -w

grepregular expressionsearch

$ printf 'asf .test. afd\nasaf foo-test asfdads\n'
asf .test. afd
asaf foo-test asfdads

$ printf 'asf .test. afd\nasaf foo-test asfdads\n' | grep -w test
asf .test. afd
asaf foo-test asfdads

Question: How can I match the "foo-test"? To be more precise, how can I say to "-w" use "-" as separator, but don't use "."?

Or in other words, can I tell grep that . is among the characters that make up words, and thus that there's no word boundary in between . and test?

Or are there other solutions than grep?

Best Answer

In versions prior to 2.19, GNU grep's -w would only consider single-byte character alnums and underscore (so in UTF-8 locales, only the 26+26+10+1 (ASCII letters, digits and underscore)) as word constituents. So for instance echo Stéphane | grep -w St would match. That was fixed in 2.19.

You could however implement the logic by hand:

 grep -E '([^[:alnum:]_.]|^)test([^[:alnum:]_.]|$)'

That is test preceded by either a non-word-constituent or the beginning of the line and followed by either a non-word-constituent or the end of the line.

(above [:alnum:] matches digits and letters in your locale, not only ASCII ones, fix the locale to C if you want only ASCII ones).

If you don't want those surrounding non-word-constituents to be included in the match (for instance because you're using GNU's -o), you can this time use PCRE regexps and look-around operators:

grep -Po '(*UCP)(?<![\w.])test(?![\w.])'

Remove (*UCP) and add LC_ALL=C to match only ASCII letters and digits.

Using (*UCP) at the start of a regexp tells the PCRE library that U̲niC̲ode P̲roperties have to be used for \w.

Without it, \w would match your locale's alphanumericals and underscore but only for single-byte characters. That wouldn't work in UTF-8 locales (the norm nowadays) where only ASCII ones would be matched. (*UCP) makes it work for UTF-8 as well. It would match based on PCRE's own notion of character properties which might be different from your locale's, but on GNU systems, that's just as well as the UTF-8 locale definitions there are incomplete and outdated (at least as of 2015-04).

Related Question