GREP and REGEX – Troubleshooting Common Issues

arch linuxgrepregular expression

I'm trying to get the output from ls /dev to match 'tty' that ends with numbers between 1-4.

So from:

tty5
tty4
tty2
tty6
tty1

Should match:

tty4
tty2
tty1

The regexp

"\s([tty]+[0-4])\s"

works in RegExr.

I've tried using this with grep:

ls /dev | grep -E \s([tty]+[0-4])\s

ls /dev | grep -E \s([tty]\+\[0-4])\s

ls /dev | grep -Ex \s([tty]+[0-4])\s

ls /dev | grep -P \s([tty]+[0-4])\s

as I've read in other posts, still I can't make it work.

Best Answer

The reason it isn't matching is because you are looking for whitespace (\s) before the string tty and at the end of your match. That never happens here since ls will print one entry per line. Note that ls is not the same as ls | command. When the output of ls is piped, that activates the -1 option causing ls to only print one entry per line. It will work as expected if you just remove those \s:

ls /dev | grep -E '([tty]+[0-4])'

However, that will also match all sorts of things you don't want. That regex isn't what you need at all. The [ ] make a character class. The expression [tty]+ is equivalent to [ty]+ and will match one or more t or y. This means it will match t,or tttttttttttttttt, or tytytytytytytytytyt or any other combination of one or both of those letters. Also, the parentheses are pointless here, they make a capture group but you're not using it. What you want is this:

$ ls /dev | grep '^tty[0-4]$'
tty0
tty1
tty2
tty3
tty4

Note how I added the $ there. That's so the expression only matches tty and then one number, one of 1, 2, 3 or 4 until the end of the line ($).

Of course, the safe way of doing this that avoids all of the dangers of parsing ls is to use globs instead:

$ ls /dev/tty[0-4]
/dev/tty0  /dev/tty1  /dev/tty2  /dev/tty3  /dev/tty4

or just

$ echo /dev/tty[0-4]
/dev/tty0 /dev/tty1 /dev/tty2 /dev/tty3 /dev/tty4

Related Solutions

Grep Options – Difference Between -e and -E

-e is strictly the flag for indicating the pattern you want to match against. -E controls whether you need to escape certain special characters.

man grep explains -E it a bit more:

   Basic vs Extended Regular Expressions
   In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

   Traditional  egrep  did  not  support  the  {  meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a
   literal {.

   GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification.  For example, the command grep -E '{1' searches for
   the two-character string {1 instead of reporting a syntax error in the regular expression.  POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.

PCRE-regex Use grep to exclude a capturing group

grep's name comes after the g/re/p ed command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed (the stream editor) or awk for that.

Now, some grep implementations, starting with GNU grep added a -o option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep implementation like GNU's again (with -P) or pcregrep that support PCREs for their regexps.

pcregrep actually added a -o<n> option to print the content of a capture group. So you could do:

pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'

But here, the obvious standard solution is to use sed:

sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'

Or if you want perl regexps, use perl:

perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'

With GNU grep, if you don't mind the matches to appear on different lines, you can do:

$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343

Note that while \K resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.

grep -Po '.zoo.(\K\d+|.: \K.)'

would not work, just like echo foobar | grep -Po 'foo|foob' wouldn't work (at printing both foo and foob). foo|foob first matches foo and then grep looks for potential other matches in the input after the foo, so starting at the b of bar, so can't find any more after that.

Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*', we only look for :<spaces><anything> in the second part of the alternation. That does match in the part that is after .zoo.<digits> but that also means it would find those :<spaces><anything> anywhere in the input, not only when they follow .zoo.<digits>.

There is a way to work around that though, using another PCRE special operator: \G. \G matches at the start of the subject. For a single match, that's equivalent to ^, but with multiple matches (think of sed/perl's g flag in s/.../.../g) like with -o where grep tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:

grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

Where (?!^) is a negative look-ahead operator that means not at the beginning of the line, that \G will only match after a previous successful (non-empty) match, so .*:\s+\K.* will only match if it follows a previous successful match, and that can only be the .foo.<digits> one since the other part of the alternation matches til the end of the line.

On an input like:

.zoo.1.zoo.2 tar: blah

That would output:

1
2
blah

Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like

grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

That still outputs 2 on an input like .zoo.2 no colon character or .zoo.2 blah:. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces> (and also using $ to avoid issues with non-characters)

grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'

You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed/perl solutions...

Best Answer

Related Solutions

Grep Options – Difference Between -e and -E

PCRE-regex Use grep to exclude a capturing group

Related Question