Bash – How does storing the regular expression in a shell variable avoid problems with quoting characters that are special to the shell

bashregular expression

Storing the regular expression in a shell variable is often a useful way to avoid problems with quoting characters that are special to the
shell. It is sometimes difficult to specify a regular expression
literally without using quotes, or to keep track of the quoting used
by regular expressions while paying attention to the shell’s quote
removal. Using a shell variable to store the pattern decreases these
problems. For example, the following are equivalent:
pattern='[[:space:]]*(a)?b'
[[ $line =~ $pattern ]]
and
[[ $line =~ [[:space:]]*(a)?b ]]
If you want to match a character that’s special to the regular
expression grammar, it has to be quoted to remove its special meaning.
This means that in the pattern xxx.txt, the . matches any
character in the string (its usual regular expression meaning), but in
the pattern "xxx.txt" it can only match a literal .. Shell
programmers should take special care with backslashes, since
back-slashes are used both by the shell and regular expressions to
remove the special meaning from the following character. The following
two sets of commands are not equivalent:
pattern='\.'

[[ . =~ $pattern ]]
[[ . =~ \. ]]

[[ . =~ "$pattern" ]]
[[ . =~ '\.' ]]
The first two matches will succeed, but the second two will not,
because in the second two the backslash will be part of the pattern to
be matched. In the first two examples, the backslash removes the
special meaning from ., so the literal . matches. If the string in
the first examples were anything other than ., say a, the pattern
would not match, because the quoted . in the pattern loses its
special meaning of matching any single character.

How is storing the regular expression in a shell variable a useful way to avoid problems with quoting characters that are special to the shell?

The given examples don't seem to explain that.
In the given examples, the regex literals in one method and the values of the shell variable pattern in the other method are the same.

Thanks.

Best Answer

[[ ... ]] tokenisation clashes with regular expressions (more on that in my answer to your follow-up question) and \ is overloaded as a shell quoting operator and a regexp operator (with some interference between the two in bash), and even when there's no apparent reason for a clash, the behaviour can be surprising. Rules can be confusing.

Who can tell what these will do without trying it (on all possible input) with any given version of bash?

[[ $a = a|b ]]
[[ $a =~ a|b ]]
[[ $a =~ a&b ]]
[[ $a =~ (a|b) ]]
[[ $a =~ ([)}]*) ]]
[[ $a =~ [/\(] ]]
[[ $a =~ \s+ ]]
[[ $a =~ ( ) ]]
[[ $a =~ [ ] ]]
[[ $a =~ ([ ]) ]]

You can't quote the regexps, because if you do, since bash 3.2 and if bash 3.1 compatibility has not been enabled, quoting the regexps removes the special meaning of RE operator. For instance,

[[ $a =~ 'a|b' ]]

Matches if $a contains a litteral a|b only.

Storing the regexp in a variable avoids all those problems and also makes the code compatible to ksh93 and zsh (provided you limit yourself to POSIX EREs):

regexp='a|b'
[[ $a =~ $regexp ]] # $regexp should *not* be quoted.

There's no ambiguity in the parsing/tokenising of that shell command, and the regexp that is used is the one stored in the variable without any transformation.

Best Answer

Related Solutions

Why do some regex commands have opposite intepretations of ‘\’ with various characters

Shell – Number of backslashes needed for escaping regex backslash on the command-line

Related Question