Why do some regex commands have opposite intepretations of ‘\’ with various characters

findregular expression

Take, for example, this command:

find . -regex ".*\.\(cpp\|h\)"

This will find all the .h and .cpp files in your directory. The period character '.' in regular expressions usually means "any character". To get it to match only an actual period, you must escape it using the backslash character '\'.

In this case, given a character with a special meaning, you must escape it to get the actual character it represents.

Now, take the parenthesis and the "or" bar, being characters '(', ')', and '|', respectively. These also have special meanings, used for grouping regular expressions. However, to get the special meaning, the characters must be escaped using the backslash! Without the backslash, the characters have the meaning of the actual character it represents.

Why is the '.' treated differently from '(', ')', and '|'?

Best Answer

The answer is really "just because". There's a whole bunch of different regular expression syntaxes, and while they share a similar appearance and usually the basics are the same, they vary in the particulars.

Historically, every tool had its own new implementation, doing whatever the author thought best. There's a balance between making characters special with and without escaping — too many characters that are "naturally special" and you end up having to escape them all the time just to match on them; or, the other way around, you end up needing a bunch of escapes to use common regex syntax like () grouping. And everyone writing a program decided how to do it based on the needs of what their program matched against, on what they felt was the right approach, and on the phase of the moon.

There's an attempt at standardization from POSIX, which defines "basic regular expressions" and "extended regular expressions". Awesomely, these work backwards from each other in regards to \sometimes, but not with perfect consistency.

Perl regular expressions have become another defacto standard, for two reasons: first, they're very flexible and powerful, and second, they're actually pretty sane, with conventions like "\ always escapes a non-alphanumeric character".

GNU Find has a -regextype option, where you can change the regular expression syntax used. Sadly, "perl" is not an option, at least in the version of find I have. (The default is, not surprisingly from GNU, "emacs", and that syntax is documented here.)