Need to escape regex characters in sed to be interpreted as regex characters

quotingregular expressionsed

It seems e.g.
cat sed_data.txt | sed 's/\b[0-9]\{3\}\b/NUMBER/g'
that I must escape characters to form a regular expression. In this case I had to escape braces in order to be interpreted as a number of times.
Why? I was expecting that everything would be a regex character unless escaped. I.e. the opposite.

Best Answer

This is because sed uses POSIX BREs (Basic Regular Expressions) as opposed to the EREs (Extended Regular Expressions) you're probably used to from Perl or friends.

From the sed(1) man page:

REGULAR EXPRESSIONS
       POSIX.2 BREs should be supported, but they aren't completely because of
       performance problems.  The \n sequence in a regular expression  matches
       the newline character, and similarly for \a, \t, and other sequences.

Relevant quote from the above link:

The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacharacters.

Quoted verbatim from Craig Sanders' comment:

Note that in GNU sed at least, you can tell sed to use extended regexps with the -r or --regexp-extended command line option. This is useful if you want to avoid uglifying your sed script with excessive escaping.

Related Solutions

Why do some regex commands have opposite intepretations of ‘\’ with various characters

The answer is really "just because". There's a whole bunch of different regular expression syntaxes, and while they share a similar appearance and usually the basics are the same, they vary in the particulars.

Historically, every tool had its own new implementation, doing whatever the author thought best. There's a balance between making characters special with and without escaping — too many characters that are "naturally special" and you end up having to escape them all the time just to match on them; or, the other way around, you end up needing a bunch of escapes to use common regex syntax like () grouping. And everyone writing a program decided how to do it based on the needs of what their program matched against, on what they felt was the right approach, and on the phase of the moon.

There's an attempt at standardization from POSIX, which defines "basic regular expressions" and "extended regular expressions". Awesomely, these work backwards from each other in regards to \ — sometimes, but not with perfect consistency.

Perl regular expressions have become another defacto standard, for two reasons: first, they're very flexible and powerful, and second, they're actually pretty sane, with conventions like "\ always escapes a non-alphanumeric character".

GNU Find has a -regextype option, where you can change the regular expression syntax used. Sadly, "perl" is not an option, at least in the version of find I have. (The default is, not surprisingly from GNU, "emacs", and that syntax is documented here.)

Shell – grep and escaping a dollar sign

There's 2 separate issues here.

grep uses Basic Regular Expressions (BRE), and $ is a special character in BRE's only at the end of an expression. The consequence of this is that the 2 instances of $ in $Id$ are not equal. The first one is a normal character and the second is an anchor that matches the end of the line. To make the second $ match a literal $ you'll have to backslash escape it, i.e. $Id\$ . Escaping the first $ also works: \$Id\$, and I prefer this since it looks more consistent.¹
There are two completely unrelated escaping/quoting mechanisms at work here: shell quoting and regex backslash quoting. The problem is many characters that regular expressions use are special to the shell as well, and on top of that the regex escape character, the backslash, is also a shell quoting character. This is why you often see messes involving double backslashes, but I do not recommend using backslashes for shell quoting regular expressions because it is not very readable.

Instead, the simplest way to do this is to first put your entire regex inside single quotes as in 'regex'. The single quote is the strongest form of quoting the shell has, so as long as your regex does not contain single quotes, you no longer have to worry about shell quoting and can focus on pure BRE syntax.

So, applying this back to your original example, let's throw the correct regex (\$Id\$) inside single quotes. The following should do what you want:

grep '\$Id\$' my_dir/my_file

The reason \$Id\$ does not work is because after shell quote removal (the more correct way of saying shell quoting) is applied, the regex that grep sees is $Id$ . As explained in (1.), this regex matches a literal $Id only at the end of a line because the first $ is literal while the second is a special anchor character.

^{¹ Note also that if you ever switch to Extended Regular Expressions (ERE), e.g. if you decided to use egrep (or grep -E), the $ character is always special. In ERE's $Id$ would never match anything because you can't have characters after the end of a line, so \$Id\$ would be the only way to go.}

Best Answer

Related Solutions

Why do some regex commands have opposite intepretations of ‘\’ with various characters

Shell – grep and escaping a dollar sign

Related Question