Shell – Email Matching Regex for Grep

command linegrepregular expressionshell

I created a text file and put some email addresses in it. Then I used grep to find them. Indeed it worked:

# pattern="^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]{2,}"
# grep -E $pattern regexfile

but only as long I kept the -E option for an extended regular expression. How do I need to change the above regex in order to use grep without -E option?

Best Answer

Be aware that matching email addresses is a LOT harder that what you have. See an excerpt from the Mastering Regular Expressions book

However, to answer your question, for a basic regular expression, your quantifiers need to be one of *, \+ or \{m,n\} (with the backslashes)

pattern='^[a-zA-Z0-9]\+@[a-zA-Z0-9]\+\.[a-z]\{2,\}'
grep "$pattern" regexfile

You need to quote the pattern variable

How to quote special characters (portably)

The following snippet adds a backslash before each character that's special in extended regular expressions, using sed to replace any occurence of one of the characters ][()\.^$?*+ by a backslash followed by that character:

raw_string='test[string]\.wibble'
quoted_string=$(printf %s "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')

This will remove trailing newlines in $raw_string; if that's a problem, ensure that the string doesn't end with a newline by adding an inert character at the end, then strip off that character.

quoted_string=$(printf %sa "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')
quoted_string=${quoted_string%?}

How to quote special characters (in bash or zsh)

Bash and zsh have a pattern replacement feature, which can be faster if the string is not very long. It's cumbersome here because the replacement must be a string, so each character needs to be replaced separately. Note that you must escape the backslashes first.

quoted_string=${raw_string//\\//\\\\}
for c in \[ \] \( \) \. \^ \$ \? \* \+; do
  quoted_string=${quoted_string//"$c"/"\\$c"}
done

How to quote special characters (in ksh93)

Ksh's string replacement construct is more powerful than the watered-down version in bash and zsh. It supports references to groups in the pattern.

quoted_string=${raw_string//@([][()\.^$?*+])/\\\1}

What you actually want

You don't need find here: shell patterns are sufficient to match files ending with three digits. If no part file exists, the glob pattern is left unexpanded. There's also a simpler way of adding the file sizes: rather than use stat (which exists on many unix variants but has a different syntax on each) and do complex pipelining to sum the values, you can call wc -c (on regular files, on most systems, wc will look at the file size and not bother to open the file and read the bytes).

set -- "$DESTINATION/$FILE_BASENAME".[0-9][0-9][0-9]
case $1 in
  *\]) # The glob was left intact, so no part exists
    do_split …;;
  *) # The glob was expanded, so at least one part exists
    FILE_SIZE_EXISTING=$(wc -c "$@" | sed -n '$s/[^0-9]//gp')
    if [ "$FILE_SIZE_EXISTING" -ne "$(wc -c <"$DESTINATION/$FILE_BASENAME")" ]; then
      do_split …
    fi

Note that your test on the total size is not very reliable: if the file has changed but remained the same size, you'll end up with stale parts. That's ok if the files never change and the only risk is that parts may be truncated or missing.

Grep Options – Difference Between -e and -E

-e is strictly the flag for indicating the pattern you want to match against. -E controls whether you need to escape certain special characters.

man grep explains -E it a bit more:

   Basic vs Extended Regular Expressions
   In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

   Traditional  egrep  did  not  support  the  {  meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a
   literal {.

   GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification.  For example, the command grep -E '{1' searches for
   the two-character string {1 instead of reporting a syntax error in the regular expression.  POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.

Best Answer

Related Solutions

Escaping of meta characters in basic/extended posix regex strings in grep

How to quote special characters (portably)

How to quote special characters (in bash or zsh)

How to quote special characters (in ksh93)

What you actually want

Grep Options – Difference Between -e and -E

Related Question