I wrote a regular expression which works well in a certain program (grep, sed, awk, perl, python, ruby, ksh, bash, zsh, find, emacs, vi, vim, gedit, …). But when I use it in a different program (or on a different unix variant), it stops matching. Why?
Why Regular Expressions Work Differently Across Platforms
regular expressionwildcards
Best Answer
Unfortunately, for historical reasons, different tools have slightly different regular expression syntax, and sometimes some implementations have extensions that are not supported by other tools. While there is a common ground, it seems like every tool writer made some different choices.
The consequence is that if you have a regular expression that works in one tool, you may need to modify it to work in another tool. The main differences between common tools are:
+?|(){}
require a backslash;.[]*^$
and usually+?|()
In this answer, I list the main standards. Check the documentation of the tools you're using for the details.
Wikipedia's comparison of regular expression engines has a table listing the features supported by common implementations.
Basic regular expressions (BRE)
Basic regular expressions are codified by the POSIX standard. It is the syntax used by
grep
,sed
andvi
. This syntax provides the following features:^
and$
match only at the beginning and end of a line..
matches any character (or any character except a newline).[…]
matches any one character listed inside the brackets (character set). If the first character after the opening bracket is a^
, the characters which are not listed are matched instead. To include a]
, put it immediately after the opening[
(or after[^
if it's a negative set). If-
is between two characters, it denotes a range; to include a literal-
, put it where it can't be parsed as a range.^$.*\[
quotes the next character.*
matches the preceding character or subexpression 0, 1 or more times.\(…\)
is a syntactic group, for use with the*
operator or backreferences and\DIGIT
replacements.\1
,\2
, … match the exact text matched by the corresponding group, e.g.\(fo*\)\(ba*\)\1
matchesfoobaafoo
but notfoobaafo
. There is no standard way to refer to the 10th group and beyond (the standard meaning of\10
is the first group followed by a0
).The following features are also standard, but missing from some restricted implementations:
\{m,n\}
matches the preceding character or subexpression between m to n times; n or m can be omitted, and\{m\}
means exactly m.[[:alpha:]]
matches any letter. Modern implementations of bracket expressions) also include collating elements like[.ll.]
and equivalence classes like[=a=]
.The following are common extensions (especially in GNU tools), but they are not found in all implementations. Check the manual of the tool you're using.
\|
for alternation:foo\|bar
matchesfoo
orbar
.\?
(short for\{0,1\}
) and\+
(short for\{1,\}
) match the preceding character or subexpression at most 1 time, or at least 1 time respectively.\n
matches a newline,\t
matches a tab, etc.\w
matches any word constituent (short for[_[:alnum:]]
but with variation when it comes to localisation) and\W
matches any character that isn't a word constituent.\<
and\>
match the empty string only at the beginning or end of a word respectively;\b
matches either, and\B
matches where\b
doesn't.Note that tools without the
\|
operator do not have the full power of regular expressions. Backreferences allow a few extra things that can't be done with regular expressions in the mathematical sense.Extended regular expressions (ERE)
Extended regular expressions are codified by the POSIX standard. Their major advantage over BRE is regularity: all standard operators are bare punctuation characters, a backslash before a punctuation character always quotes it. It is the syntax used by
awk
,grep -E
oregrep
, GNUsed -r
, and bash's=~
operator. This syntax provides the following features:^
and$
match only at the beginning and end of a line..
matches any character (or any character except a newline).[…]
matches any one character listed inside the brackets (character set). Complementation with an initial^
and ranges work like in BRE (see above). Character classes can be used but are missing from a few implementations. Modern implementations also support equivalence classes and collating elements. A backslash inside brackets quotes the next character in some but not all implementations; use\\
to mean a backslash for portability.(…)
is a syntactic group, for use with*
or\DIGIT
replacements.|
for alternation:foo|bar
matchesfoo
orbar
.*
,+
and?
matches the preceding character or subexpression a number of times: 0 or more for*
, 1 or more for+
, 0 or 1 for?
.{m,n}
matches the preceding character or subexpression between m and n times (missing from some implementations); n or m can be omitted, and{m}
means exactly m.\DIGIT
backreferences (notably absent in awk except in the busybox implementation where you can use$0 ~ "(...)\\1"
); special characters\n
,\t
, etc.; word boundaries\b
and\B
, word constituents\b
and\B
, …PCRE (Perl-compatible regular expressions)
PCRE are extensions of ERE, originally introduced by Perl and adopted by GNU
grep -P
and many modern tools and programming languages, usually via the PCRE library. See the Perl documentation for nice formatting with examples. Not all features of the latest version of Perl are supported by PCRE (e.g. Perl code execution is only supported in Perl). See the PCRE manual for a summary of supported features. The main additions to ERE are:(?:…)
is a non-capturing group: like(…)
, but does not count for backreferences.(?=FOO)BAR
(lookahead) matchesBAR
, but only if there is also a match forFOO
starting at the same position. This is most useful to anchor a match without including the following text in the match:foo(?=bar)
matchesfoo
but only if it's followed bybar
.(?!FOO)BAR
(negative lookahead) matchesBAR
, but there is not also a match forFOO
at the same position. For example(?!foo)[a-z]+
matches any lowercase word that does not start withfoo
;[a-z]+(?![0-9)
matches any lowercase word that is not followed by a digit (so infoo123
, it matchesfo
but notfoo
).(?<=FOO)BAR
(lookbehind) matchesBAR
, but only if it is immediately preceded by a match forFOO
.FOO
must have a known length (you can't use repetition operators such as*
). This is most useful to anchor a match without including the preceding text in the match:(?<=^| )foo
matchesfoo
but only if it's preceded by a space or by the beginning of the string.(?<!FOO)BAR
(negative lookbehind) matchesBAR
, but only if it is not immediately preceded by a match forFOO
.FOO
must have a known length (you can't use repetition operators such as*
). This is most useful to anchor a match without including the preceding text in the match:(?<![a-z])foo
matchesfoo
but only if it is not preceded by a lowercase letter.Emacs
Emacs's syntax is intermediate between BRE and ERE. In addition to Emacs, it is the default syntax for
-regex
in GNU find. Emacs offers the following operators:^
,$
,.
,[…]
,*
,+
,?
as in ERE\(…\)
,\|
,\{…\}
,\DIGIT
as in BRE\<
and\>
for word boundaries; and more in recent versions of Emacs, that are often not supported in other engines with an Emacs-like syntax.Shell globs
Shell globs (wildcards) perform pattern matching with a syntax that is completely different from regular expressions and less powerful. In addition to shells, these wildcards are available with other tools such as
find -name
and rsync filters. POSIX patterns include the following features:?
matches any single character.[…]
is a character set as in common regular expression syntaxes. Some shells do not support character classes. Some shells require!
instead of^
to negate the set.*
matches any sequence of characters (often except/
when matching file paths; if/
is excluded from*
, then**
sometimes includes/
, but check the tool's documentation).Ksh offers additional features which give its pattern matching the full power of regular expressions. These features are also available in bash after running
shopt -s extglob
. Zsh has a different syntax but can also support ksh's syntax aftersetopt ksh_glob
.