PCRE-regex Use grep to exclude a capturing group

greppcreregular expressiontext processing

I am using GNU grep with the -P PCRE Regex support for matching strings from a file. The input file has lines containing strings like:

FOO_1BAR.zoo.2.someString:More-RandomString (string here too): 0.45654343

I want to capture the numbers 2 and 0.45654343 from the above line. I used a regEx

grep -Po ".zoo.\K[\d+](.*):\ (.*)$" file

But this is producing me a result as

2.someString:More-RandomString (string here too): 0.45654343

I am able to get the first number from the first capturing group as 2, and also to match a capturing group at the end of the line. But I am not able to skip the words/lines between two capturing groups.

I know for a fact that I have a group (.*) that is capturing those words in the middle. What I've tried to do is include another \K to ignore it as

grep -Po ".zoo.\K[\d+](.*):\K (.*)$" file

But that gave me only the second capture group as 0.556984.

Also with a non-capturing group with the (?:) syntax as

grep -Po ".zoo.\K[\d+](?=.someString:More-RandomString (string here too)):\ (.*)$"

But this gave me nothing. What am I missing here?

Best Answer

grep's name comes after the g/re/p ed command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed (the stream editor) or awk for that.

Now, some grep implementations, starting with GNU grep added a -o option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep implementation like GNU's again (with -P) or pcregrep that support PCREs for their regexps.

pcregrep actually added a -o<n> option to print the content of a capture group. So you could do:

pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'

But here, the obvious standard solution is to use sed:

sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'

Or if you want perl regexps, use perl:

perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'

With GNU grep, if you don't mind the matches to appear on different lines, you can do:

$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343

Note that while \K resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.

grep -Po '.zoo.(\K\d+|.: \K.)'

would not work, just like echo foobar | grep -Po 'foo|foob' wouldn't work (at printing both foo and foob). foo|foob first matches foo and then grep looks for potential other matches in the input after the foo, so starting at the b of bar, so can't find any more after that.

Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*', we only look for :<spaces><anything> in the second part of the alternation. That does match in the part that is after .zoo.<digits> but that also means it would find those :<spaces><anything> anywhere in the input, not only when they follow .zoo.<digits>.

There is a way to work around that though, using another PCRE special operator: \G. \G matches at the start of the subject. For a single match, that's equivalent to ^, but with multiple matches (think of sed/perl's g flag in s/.../.../g) like with -o where grep tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:

grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

Where (?!^) is a negative look-ahead operator that means not at the beginning of the line, that \G will only match after a previous successful (non-empty) match, so .*:\s+\K.* will only match if it follows a previous successful match, and that can only be the .foo.<digits> one since the other part of the alternation matches til the end of the line.

On an input like:

.zoo.1.zoo.2 tar: blah

That would output:

1
2
blah

Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like

grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

That still outputs 2 on an input like .zoo.2 no colon character or .zoo.2 blah:. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces> (and also using $ to avoid issues with non-characters)

grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'

You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed/perl solutions...

Related Question