Text Processing – Can Grep Output Only Specified Groupings That Match

grepregular expressiontext processing

Say I have a file:

# file: 'test.txt'
foobar bash 1
bash
foobar happy
foobar

I only want to know what words appear after "foobar", so I can use this regex:

"foobar \(\w\+\)"

The parenthesis indicate that I have a special interest in the word right after foobar. But when I do a grep "foobar \(\w\+\)" test.txt, I get the entire lines that match the entire regex, rather than just "the word after foobar":

foobar bash 1
foobar happy

I would much prefer that the output of that command looked like this:

bash
happy

Is there a way to tell grep to only output the items that match the grouping (or a specific grouping) in a regular expression?

Best Answer

GNU grep has the -P option for perl-style regexes, and the -o option to print only what matches the pattern. These can be combined using look-around assertions (described under Extended Patterns in the perlre manpage) to remove part of the grep pattern from what is determined to have matched for the purposes of -o.

$ grep -oP 'foobar \K\w+' test.txt
bash
happy
$

The \K is the short-form (and more efficient form) of (?<=pattern) which you use as a zero-width look-behind assertion before the text you want to output. (?=pattern) can be used as a zero-width look-ahead assertion after the text you want to output.

For instance, if you wanted to match the word between foo and bar, you could use:

$ grep -oP 'foo \K\w+(?= bar)' test.txt

or (for symmetry)

$ grep -oP '(?<=foo )\w+(?= bar)' test.txt

Related Solutions

Bash event not found trying to match and exclude parenthesis in grep

The rules are different for single quotes versus double quotes.

For the reason you show, double quotes can't be used reliably in bash, because there's no sane way to escape an exclamation mark.

$ grep -oP "\\(.*(?!word).*right"
bash: !word: event not found

$ grep -oP "\\(.*(?\!word).*right"
grep: unrecognized character after (? or (?-

The second is because bash passes through \! rather than ! to grep. Showing this:

$ printf '%s' "\!"
\!

When you tried single quotes, the double backslash doesn't mean an escaped backslash, it means two backslashes.

$ printf '%s' '\\(.*(?!word).*right'
\\(.*(?!word).*right

Inside single quotes, everything is literal, and there are no escapes, so the way to write the regular expression you're trying is:

$ grep -oP '\(.*(?!word).*right'

Why does grep output lines that seemingly don’t match the expression

This looks like your locale collation rules being very ... helpful.

Try it with

LC_ALL=C grep [A-Z] input.txt

to test that idea.

I have

export LANG=en_US.UTF-8
export LC_COLLATE=C
export LC_NUMERIC=C

in my shell startup to avoid this kind of trouble while still getting my unicode goodness.

Best Answer

Related Solutions

Bash event not found trying to match and exclude parenthesis in grep

Why does grep output lines that seemingly don’t match the expression

Related Question