Linux Grep – Regular Expression ’emm*[a-f].[^ta]$’

greplinuxregular expression

I've been asking 1 hour ago a similar question about regular expression using the grep command, pardon me if the prefered choice would have been to post in the same thread, if this is the case I would do so next time.

It might seems like basic synthax, but I'm trying to understand how regular expression recognition pattern works and the results I get seems to be contradicting the manual I'm reading about them ( I'm most likely not interpreting the material properly).

A files contains the following list of words:

mael@mael-HP:~/repertoireVide$ cat MySQLServ
remembré
emmuré
emmené
dilemmes
jumeaux
écrémage
emmena
emmailloter
flemmard

The following command gives the output

mael@mael-HP:~/repertoireVide$ grep -r 'emm*[a-f].[^ta]$'
MySQLServ:remembré
MySQLServ:emmené
MySQLServ:flemmard

I'm wondering why grep is not matching the word 'emmailloter', since 'emmailloter':

contains 'em'
contains a caracter between [a-f] afterwards : 'a'
'i' fulfills the '.' component
does not end with either the caracter 't' or 'a'

Thanks.

Best Answer

The word emmailloter contains much more than i between the bits matched by [a-f] and [^ta]$. The . pattern only ever matches a single character, so if you want to match multiple characters between emma and r at the end, you will have to allow for multiple characters:

emm*[a-f]..*[^ta]$

With grep -E (enabling extended regular expressions), ..* could be written .+, i.e. "match at least one character". The expression ..* reads as "match a character, and then possibly more characters". In the same way, emm* could be replaced by em+, i.e. "e followed by at least one m" if using grep -E.

This would match the string

blop-emmmmmmmmma-blarg-b
     ^^^^^^^^^^^^^^^^^^^
     1111111111233333334

1: emm*
2: [a-f]
3: ..*
4: [^ta]$

(the matching part indicated by the ^ characters above), for example, and also emmailloter:

emmailloter
^^^^^^^^^^^
11123333334

Testing:

$ grep -E 'emm*[a-f].+[^ta]$' MySQLServ
remembré
emmené
emmailloter
flemmard

Note that for the word remembré, the match will be

remembré
 ^^^^^^^
 1123334

not

remembré
   ^^^^^
   11234

One way to visualise the matches using sed:

$ sed -n -E 's/(emm*)([a-f])(.+)([^ta]$)/(\1)(\2)(\3)(\4)/p' MySQLServ
r(em)(e)(mbr)(é)
(emm)(e)(n)(é)
(emm)(a)(illote)(r)
fl(emm)(a)(r)(d)

This will only print matching lines, with each matched part of the regular expression in parentheses. This also assumes that you are using a sed implementation that can be used to match French characters and that the locale environment variables are properly set up for doing that.

Compare this with what your original expression produces:

$ sed -n -E 's/(emm*)([a-f])(.)([^ta]$)/(\1)(\2)(\3)(\4)/p' MySQLServ
rem(em)(b)(r)(é)
(emm)(e)(n)(é)
fl(emm)(a)(r)(d)

Related Solutions

bash grep – Why Quote Escaped Character in Regex

Why? because your shell interprets some special characters, such as \ in your example.

You are running into troubles because you do not protect the string that you try to pass as argument to grep via the Shell.

Several solutions:

singlequoting the string,
doublequoting the string (with doublequoting the shell will interpret several things, such as $variables , before sending the resulting string to the command),
or not use quoting (which I strongly advise against) but add backslashes in the right places to prevent the shell to interpret the next characters before sending it to the command.

I recommend to protect the string via single quotes, as it keeps almost everything literraly:

grep '9\.0' #send those 4 characters to grep in a single argument

The Shell pass the singlequoted string literally.

Note: The only thing you can't include inside a single quoted shell string is a single quote (as this ends the singlequoting). To include a singlequote inside a singlequoted shell string, you need to first end the singlequoting, immediately add an escaped singlequote \' (or one between doublequotes: "'" ) and then immediately reenter the singlequoting to continue the single quoted string : for exemple to have the shell execute the command grep a'b , you could write the parameter as 'a'\''b' so that the shell sends a'b to grep: so write: grep 'a'\''b' , or grep 'a'"'"'b'

If you insist on not using quoting, you need your shell to have a \\ to have it send a \ to grep.

grep 9\\.0  # ie: a 9, a pair \\, a ., and a 0 , and the shell interprets the pair \\ into a literal \

If you use doublequotes: you need to take into account that the shell will interprets several things first ($vars, \, etc). for exemple when it sees an unescaped or unquoted \, it waits the next character to decide how to interpret it. \w is seen as a single letter w, \\ is seen as a single letter \, etc.

grep "9\\.0"  # looks here the same as not quoting at all... 
    #but doublequoting allows you to have spaces, etc, inside the string

Best Answer

Related Solutions

bash grep – Why Quote Escaped Character in Regex

Related Question