Grep – How to Define Word Characters for grep -w

grepregular expressionsearch

$ printf 'asf .test. afd\nasaf foo-test asfdads\n'
asf .test. afd
asaf foo-test asfdads

$ printf 'asf .test. afd\nasaf foo-test asfdads\n' | grep -w test
asf .test. afd
asaf foo-test asfdads

Question: How can I match the "foo-test"? To be more precise, how can I say to "-w" use "-" as separator, but don't use "."?

Or in other words, can I tell grep that . is among the characters that make up words, and thus that there's no word boundary in between . and test?

Or are there other solutions than grep?

Best Answer

In versions prior to 2.19, GNU grep's -w would only consider single-byte character alnums and underscore (so in UTF-8 locales, only the 26+26+10+1 (ASCII letters, digits and underscore)) as word constituents. So for instance echo Stéphane | grep -w St would match. That was fixed in 2.19.

You could however implement the logic by hand:

 grep -E '([^[:alnum:]_.]|^)test([^[:alnum:]_.]|$)'

That is test preceded by either a non-word-constituent or the beginning of the line and followed by either a non-word-constituent or the end of the line.

(above [:alnum:] matches digits and letters in your locale, not only ASCII ones, fix the locale to C if you want only ASCII ones).

If you don't want those surrounding non-word-constituents to be included in the match (for instance because you're using GNU's -o), you can this time use PCRE regexps and look-around operators:

grep -Po '(*UCP)(?<![\w.])test(?![\w.])'

Remove (*UCP) and add LC_ALL=C to match only ASCII letters and digits.

Using (*UCP) at the start of a regexp tells the PCRE library that U̲niC̲ode P̲roperties have to be used for \w.

Without it, \w would match your locale's alphanumericals and underscore but only for single-byte characters. That wouldn't work in UTF-8 locales (the norm nowadays) where only ASCII ones would be matched. (*UCP) makes it work for UTF-8 as well. It would match based on PCRE's own notion of character properties which might be different from your locale's, but on GNU systems, that's just as well as the UTF-8 locale definitions there are incomplete and outdated (at least as of 2015-04).

Related Solutions

Text Processing – Can Grep Output Only Specified Groupings That Match

GNU grep has the -P option for perl-style regexes, and the -o option to print only what matches the pattern. These can be combined using look-around assertions (described under Extended Patterns in the perlre manpage) to remove part of the grep pattern from what is determined to have matched for the purposes of -o.

$ grep -oP 'foobar \K\w+' test.txt
bash
happy
$

The \K is the short-form (and more efficient form) of (?<=pattern) which you use as a zero-width look-behind assertion before the text you want to output. (?=pattern) can be used as a zero-width look-ahead assertion after the text you want to output.

For instance, if you wanted to match the word between foo and bar, you could use:

$ grep -oP 'foo \K\w+(?= bar)' test.txt

or (for symmetry)

$ grep -oP '(?<=foo )\w+(?= bar)' test.txt

Linux – Using Grep to Find Multiple Repeating Characters in a Word

This calls for backreferences!

If you are ever referring to something you have already matched, and you want to match it again, use backreferences.

grep '(..)(.*\1){<n - 1>}' <file>

.* matches any sequence of characters
(..) matches any two characters
\1 matches the first group, in this case the (..) near the beginning

Substitute <n - 1> for the length of the sequence minus one, and <file> with the name of the file you want to look for (or omit it to use stdin).

This may not be the most efficient solution, but it works.

Best Answer

Related Solutions

Text Processing – Can Grep Output Only Specified Groupings That Match

Linux – Using Grep to Find Multiple Repeating Characters in a Word

Related Question