I am using GNU grep
with the -P
PCRE Regex support for matching strings from a file. The input file has lines containing strings like:
FOO_1BAR.zoo.2.someString:More-RandomString (string here too): 0.45654343
I want to capture the numbers 2
and 0.45654343
from the above line. I used a regEx
grep -Po ".zoo.\K[\d+](.*):\ (.*)$" file
But this is producing me a result as
2.someString:More-RandomString (string here too): 0.45654343
I am able to get the first number from the first capturing group as 2
, and also to match a capturing group at the end of the line. But I am not able to skip the words/lines between two capturing groups.
I know for a fact that I have a group (.*)
that is capturing those words in the middle. What I've tried to do is include another \K
to ignore it as
grep -Po ".zoo.\K[\d+](.*):\K (.*)$" file
But that gave me only the second capture group as 0.556984
.
Also with a non-capturing group with the (?:)
syntax as
grep -Po ".zoo.\K[\d+](?=.someString:More-RandomString (string here too)):\ (.*)$"
But this gave me nothing. What am I missing here?
Best Answer
grep
's name comes after theg/re/p
ed
command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You havesed
(the stream editor) orawk
for that.Now, some
grep
implementations, starting with GNUgrep
added a-o
option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got somegrep
implementation like GNU's again (with-P
) orpcregrep
that support PCREs for their regexps.pcregrep
actually added a-o<n>
option to print the content of a capture group. So you could do:But here, the obvious standard solution is to use
sed
:Or if you want perl regexps, use perl:
With GNU
grep
, if you don't mind the matches to appear on different lines, you can do:Note that while
\K
resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.would not work, just like
echo foobar | grep -Po 'foo|foob'
wouldn't work (at printing bothfoo
andfoob
).foo|foob
first matchesfoo
and thengrep
looks for potential other matches in the input after thefoo
, so starting at theb
ofbar
, so can't find any more after that.Above with
grep -Po '\.zoo\.\K\d+|:\s+\K.*'
, we only look for:<spaces><anything>
in the second part of the alternation. That does match in the part that is after.zoo.<digits>
but that also means it would find those:<spaces><anything>
anywhere in the input, not only when they follow.zoo.<digits>
.There is a way to work around that though, using another PCRE special operator:
\G
.\G
matches at the start of the subject. For a single match, that's equivalent to^
, but with multiple matches (think ofsed
/perl
'sg
flag ins/.../.../g
) like with-o
wheregrep
tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:Where
(?!^)
is a negative look-ahead operator that means not at the beginning of the line, that\G
will only match after a previous successful (non-empty) match, so.*:\s+\K.*
will only match if it follows a previous successful match, and that can only be the.foo.<digits>
one since the other part of the alternation matches til the end of the line.On an input like:
That would output:
Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like
That still outputs
2
on an input like.zoo.2 no colon character
or.zoo.2 blah:
. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after:<spaces>
(and also using$
to avoid issues with non-characters)You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward
sed
/perl
solutions...