-e
is strictly the flag for indicating the pattern you want to match against. -E
controls whether you need to escape certain special characters.
man grep
explains -E
it a bit more:
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a
literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification. For example, the command grep -E '{1' searches for
the two-character string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.
grep
's name comes after the g/re/p
ed
command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed
(the stream editor) or awk
for that.
Now, some grep
implementations, starting with GNU grep
added a -o
option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep
implementation like GNU's again (with -P
) or pcregrep
that support PCREs for their regexps.
pcregrep
actually added a -o<n>
option to print the content of a capture group. So you could do:
pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'
But here, the obvious standard solution is to use sed
:
sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'
Or if you want perl regexps, use perl:
perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'
With GNU grep
, if you don't mind the matches to appear on different lines, you can do:
$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343
Note that while \K
resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.
grep -Po '.zoo.(\K\d+|.: \K.)'
would not work, just like echo foobar | grep -Po 'foo|foob'
wouldn't work (at printing both foo
and foob
). foo|foob
first matches foo
and then grep
looks for potential other matches in the input after the foo
, so starting at the b
of bar
, so can't find any more after that.
Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*'
, we only look for :<spaces><anything>
in the second part of the alternation. That does match in the part that is after .zoo.<digits>
but that also means it would find those :<spaces><anything>
anywhere in the input, not only when they follow .zoo.<digits>
.
There is a way to work around that though, using another PCRE special operator: \G
. \G
matches at the start of the subject. For a single match, that's equivalent to ^
, but with multiple matches (think of sed
/perl
's g
flag in s/.../.../g
) like with -o
where grep
tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:
grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'
Where (?!^)
is a negative look-ahead operator that means not at the beginning of the line, that \G
will only match after a previous successful (non-empty) match, so .*:\s+\K.*
will only match if it follows a previous successful match, and that can only be the .foo.<digits>
one since the other part of the alternation matches til the end of the line.
On an input like:
.zoo.1.zoo.2 tar: blah
That would output:
1
2
blah
Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like
grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'
That still outputs 2
on an input like .zoo.2 no colon character
or .zoo.2 blah:
. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces>
(and also using $
to avoid issues with non-characters)
grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'
You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed
/perl
solutions...
Best Answer
The reason it isn't matching is because you are looking for whitespace (
\s
) before the stringtty
and at the end of your match. That never happens here sincels
will print one entry per line. Note thatls
is not the same asls | command
. When the output ofls
is piped, that activates the-1
option causingls
to only print one entry per line. It will work as expected if you just remove those\s
:However, that will also match all sorts of things you don't want. That regex isn't what you need at all. The
[ ]
make a character class. The expression[tty]+
is equivalent to[ty]+
and will match one or moret
ory
. This means it will matcht
,ortttttttttttttttt
, ortytytytytytytytytyt
or any other combination of one or both of those letters. Also, the parentheses are pointless here, they make a capture group but you're not using it. What you want is this:Note how I added the
$
there. That's so the expression only matchestty
and then one number, one of 1, 2, 3 or 4 until the end of the line ($
).Of course, the safe way of doing this that avoids all of the dangers of parsing
ls
is to use globs instead:or just