Linux Grep – Regular Expression ’emm*[a-f].[^ta]$’

greplinuxregular expression

I've been asking 1 hour ago a similar question about regular expression using the grep command, pardon me if the prefered choice would have been to post in the same thread, if this is the case I would do so next time.

It might seems like basic synthax, but I'm trying to understand how regular expression recognition pattern works and the results I get seems to be contradicting the manual I'm reading about them ( I'm most likely not interpreting the material properly).

A files contains the following list of words:

mael@mael-HP:~/repertoireVide$ cat MySQLServ
remembré
emmuré
emmené
dilemmes
jumeaux
écrémage
emmena
emmailloter
flemmard

The following command gives the output

mael@mael-HP:~/repertoireVide$ grep -r 'emm*[a-f].[^ta]$'
MySQLServ:remembré
MySQLServ:emmené
MySQLServ:flemmard

I'm wondering why grep is not matching the word 'emmailloter', since 'emmailloter':

  1. contains 'em'
  2. contains a caracter between [a-f] afterwards : 'a'
  3. 'i' fulfills the '.' component
  4. does not end with either the caracter 't' or 'a'

Thanks.

Best Answer

The word emmailloter contains much more than i between the bits matched by [a-f] and [^ta]$. The . pattern only ever matches a single character, so if you want to match multiple characters between emma and r at the end, you will have to allow for multiple characters:

emm*[a-f]..*[^ta]$

With grep -E (enabling extended regular expressions), ..* could be written .+, i.e. "match at least one character". The expression ..* reads as "match a character, and then possibly more characters". In the same way, emm* could be replaced by em+, i.e. "e followed by at least one m" if using grep -E.

This would match the string

blop-emmmmmmmmma-blarg-b
     ^^^^^^^^^^^^^^^^^^^
     1111111111233333334

1: emm*
2: [a-f]
3: ..*
4: [^ta]$

(the matching part indicated by the ^ characters above), for example, and also emmailloter:

emmailloter
^^^^^^^^^^^
11123333334

Testing:

$ grep -E 'emm*[a-f].+[^ta]$' MySQLServ
remembré
emmené
emmailloter
flemmard

Note that for the word remembré, the match will be

remembré
 ^^^^^^^
 1123334

not

remembré
   ^^^^^
   11234

One way to visualise the matches using sed:

$ sed -n -E 's/(emm*)([a-f])(.+)([^ta]$)/(\1)(\2)(\3)(\4)/p' MySQLServ
r(em)(e)(mbr)(é)
(emm)(e)(n)(é)
(emm)(a)(illote)(r)
fl(emm)(a)(r)(d)

This will only print matching lines, with each matched part of the regular expression in parentheses. This also assumes that you are using a sed implementation that can be used to match French characters and that the locale environment variables are properly set up for doing that.

Compare this with what your original expression produces:

$ sed -n -E 's/(emm*)([a-f])(.)([^ta]$)/(\1)(\2)(\3)(\4)/p' MySQLServ
rem(em)(b)(r)(é)
(emm)(e)(n)(é)
fl(emm)(a)(r)(d)
Related Question