Why Regular Expressions Work Differently Across Platforms

regular expressionwildcards

I wrote a regular expression which works well in a certain program (grep, sed, awk, perl, python, ruby, ksh, bash, zsh, find, emacs, vi, vim, gedit, …). But when I use it in a different program (or on a different unix variant), it stops matching. Why?

Best Answer

Unfortunately, for historical reasons, different tools have slightly different regular expression syntax, and sometimes some implementations have extensions that are not supported by other tools. While there is a common ground, it seems like every tool writer made some different choices.

The consequence is that if you have a regular expression that works in one tool, you may need to modify it to work in another tool. The main differences between common tools are:

whether the operators +?|(){} require a backslash;
what extensions are supported beyond the basics .[]*^$ and usually +?|()

In this answer, I list the main standards. Check the documentation of the tools you're using for the details.

Wikipedia's comparison of regular expression engines has a table listing the features supported by common implementations.

Basic regular expressions (BRE)

Basic regular expressions are codified by the POSIX standard. It is the syntax used by grep, sed and vi. This syntax provides the following features:

^ and $ match only at the beginning and end of a line.
. matches any character (or any character except a newline).
[…] matches any one character listed inside the brackets (character set). If the first character after the opening bracket is a ^, the characters which are not listed are matched instead. To include a ], put it immediately after the opening [ (or after [^ if it's a negative set). If - is between two characters, it denotes a range; to include a literal -, put it where it can't be parsed as a range.
Backslash before any of ^$.*\[ quotes the next character.
* matches the preceding character or subexpression 0, 1 or more times.
$…$ is a syntactic group, for use with the * operator or backreferences and \DIGIT replacements.
Backreferences \1, \2, … match the exact text matched by the corresponding group, e.g. $fo*$$ba*$\1 matches foobaafoo but not foobaafo. There is no standard way to refer to the 10th group and beyond (the standard meaning of \10 is the first group followed by a 0).

The following features are also standard, but missing from some restricted implementations:

\{m,n\} matches the preceding character or subexpression between m to n times; n or m can be omitted, and \{m\} means exactly m.
Inside brackets, character classes can be used, for example [[:alpha:]] matches any letter. Modern implementations of bracket expressions) also include collating elements like [.ll.] and equivalence classes like [=a=].

The following are common extensions (especially in GNU tools), but they are not found in all implementations. Check the manual of the tool you're using.

\| for alternation: foo\|bar matches foo or bar.
\? (short for \{0,1\}) and \+ (short for \{1,\}) match the preceding character or subexpression at most 1 time, or at least 1 time respectively.
\n matches a newline, \t matches a tab, etc.
\w matches any word constituent (short for [_[:alnum:]] but with variation when it comes to localisation) and \W matches any character that isn't a word constituent.
\< and \> match the empty string only at the beginning or end of a word respectively; \b matches either, and \B matches where \b doesn't.

Note that tools without the \| operator do not have the full power of regular expressions. Backreferences allow a few extra things that can't be done with regular expressions in the mathematical sense.

Extended regular expressions (ERE)

Extended regular expressions are codified by the POSIX standard. Their major advantage over BRE is regularity: all standard operators are bare punctuation characters, a backslash before a punctuation character always quotes it. It is the syntax used by awk, grep -E or egrep, GNU sed -r, and bash's =~ operator. This syntax provides the following features:

^ and $ match only at the beginning and end of a line.
. matches any character (or any character except a newline).
[…] matches any one character listed inside the brackets (character set). Complementation with an initial ^ and ranges work like in BRE (see above). Character classes can be used but are missing from a few implementations. Modern implementations also support equivalence classes and collating elements. A backslash inside brackets quotes the next character in some but not all implementations; use \\ to mean a backslash for portability.
(…) is a syntactic group, for use with * or \DIGIT replacements.
| for alternation: foo|bar matches foo or bar.
*, + and ? matches the preceding character or subexpression a number of times: 0 or more for *, 1 or more for +, 0 or 1 for ?.
Backslash quotes the next character if it is not alphanumeric.
{m,n} matches the preceding character or subexpression between m and n times (missing from some implementations); n or m can be omitted, and {m} means exactly m.
Some common extensions as in BRE: \DIGIT backreferences (notably absent in awk except in the busybox implementation where you can use $0 ~ "(...)\\1"); special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …

PCRE (Perl-compatible regular expressions)

PCRE are extensions of ERE, originally introduced by Perl and adopted by GNU grep -P and many modern tools and programming languages, usually via the PCRE library. See the Perl documentation for nice formatting with examples. Not all features of the latest version of Perl are supported by PCRE (e.g. Perl code execution is only supported in Perl). See the PCRE manual for a summary of supported features. The main additions to ERE are:

(?:…) is a non-capturing group: like (…), but does not count for backreferences.
(?=FOO)BAR (lookahead) matches BAR, but only if there is also a match for FOO starting at the same position. This is most useful to anchor a match without including the following text in the match: foo(?=bar) matches foo but only if it's followed by bar.
(?!FOO)BAR (negative lookahead) matches BAR, but there is not also a match for FOO at the same position. For example (?!foo)[a-z]+ matches any lowercase word that does not start with foo; [a-z]+(?![0-9) matches any lowercase word that is not followed by a digit (so in foo123, it matches fo but not foo).
(?<=FOO)BAR (lookbehind) matches BAR, but only if it is immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<=^| )foo matches foo but only if it's preceded by a space or by the beginning of the string.
(?<!FOO)BAR (negative lookbehind) matches BAR, but only if it is not immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<![a-z])foo matches foo but only if it is not preceded by a lowercase letter.

Emacs

Emacs's syntax is intermediate between BRE and ERE. In addition to Emacs, it is the default syntax for -regex in GNU find. Emacs offers the following operators:

^, $, ., […], *, +, ? as in ERE
$…$, \|, \{…\}, \DIGIT as in BRE
more backslash-letter sequences; \< and \> for word boundaries; and more in recent versions of Emacs, that are often not supported in other engines with an Emacs-like syntax.

Shell globs

Shell globs (wildcards) perform pattern matching with a syntax that is completely different from regular expressions and less powerful. In addition to shells, these wildcards are available with other tools such as find -name and rsync filters. POSIX patterns include the following features:

? matches any single character.
[…] is a character set as in common regular expression syntaxes. Some shells do not support character classes. Some shells require ! instead of ^ to negate the set.
* matches any sequence of characters (often except / when matching file paths; if / is excluded from *, then ** sometimes includes /, but check the tool's documentation).
Backslash quotes the next character.

Ksh offers additional features which give its pattern matching the full power of regular expressions. These features are also available in bash after running shopt -s extglob. Zsh has a different syntax but can also support ksh's syntax after setopt ksh_glob.

Best Answer

Basic regular expressions (BRE)

Extended regular expressions (ERE)

PCRE (Perl-compatible regular expressions)

Emacs

Shell globs

Related Solutions

Bash – How/why does this globbing expression work

Related Question