Shell – History of Bash globbing

historyregular expressionshellwildcards

Is there a historical reason why Bash "globbing" and regular expressions are not identical? For example, I believe that in Bash [1-2]* matches anything that starts with a 1 or a 2 followed by anything else, while as a regular expression [1-2]* would match only a sequence of 1s and 2s. My Bash scripting and REGEX foo are both pretty weak and I regularly run into problems associated with these differences which made me curious was to why they are different.

Best Answer

bash was initially designed in the late 80s as a partial clone of ksh with some interactive features from csh/tcsh.

The origins of globbing have to be found in those earlier shells which it builds upon.

ksh itself is an extension of the Bourne shell. The Bourne shell itself (first released in 1979 in Unix V7) was a clean implementation from scratch, but it did not depart completely from the Thompson shell (the shell of V1 -> V6) and incorporated features from the Mashey shell.

In particular, command arguments were still separated by blanks, | was now the new pipe operator but ^ was still supported as an alternative (and also explains why you do [!a-z] and not [^a-z]), $1 was still the first argument to a script and backslash was still the escape character. So many of the regexp operators (^\|$) have a special meaning of their own in the shell.

The Thompson shell relied on an external utility for globbing. When sh found unquoted *, [ or ?s in the command, it would run the command through glob.

rm *.txt

would end up running glob as:

["glob", "rm", "*.txt"]

and glob would end up running rm with the list of files matching that pattern.

grep a.\*b *.txt

would run glob as:

["glob", "grep", "a.\252b", "*.txt"]

The * above has been quoted by setting the 8th bit on that character, preventing glob from treating it as a wildcard. glob would then remove that bit before calling grep.

To do the equivalent with regexps, that would have been:

regexp rm '\.txt$'

Or:

regexp rm '^[^.].*\.txt$'

to exclude dot-files.

The need to escape the operators as they double as shell special characters, the fact that ., common in filenames is a regexp operator makes it not very appropriate to match filenames and complicated for a beginner. In most cases, all you need is wildcards that can replace either one (?) or any number (*) of characters.

Now, different shells added different globbing operators. Nowadays, the ksh and zsh globs (and to some extent bash -O extglob which implements a subset of ksh globs) are functionally equivalent to regexps with a syntax that is less cumbersome to use with filenames and the current shell syntax. For instance, in zsh (with extendedglob extension), you can do:

echo a#.txt

if you want (unlikely) to match filenames that consist of sequences of a followed by .txt. Easier than echo (^a*\.txt$) (here using braces as a way to isolate the regex operators from the shell operators which could have been one way shells could deal with it).

echo (foo|bar|<1-20>).(#i)mpg

For mpg files (case insensitive) whose basename is foo, bar or a decimal number from 1 to 20...

ksh93 now can also incorporate regexps (basic, extended, perl-like or "augmented") in its globs (though it's quite buggy) and even provides a tool to convert between glob and regexp (printf %R, printf %P):

echo ~(Ei:.*\.txt)

to match (non-hidden) txt files with Extended regular expressions, case-insensitively.

POSIX specification

The POSIX specification for shell Command Line Editing (vi-mode) states that these search patterns should use regular shell pattern matching. While the ^ meta-character is used to match the start of a line, they are not regular expressions.

/pattern<newline>

Move backwards through the command history, searching for the specified pattern, beginning with the previous command line. Patterns use the pattern matching notation described in Pattern Matching Notation , except that the '^' character shall have special meaning when it appears as the first character of pattern. In this case, the '^' is discarded and the characters after the '^' shall be matched only at the beginning of a line. Commands in the command history shall be treated as strings, not as filenames.

Documented Bash implementation

Bash uses the GNU Readline library to provide its interactive line-editing and history searching capabilities. The official documentation for the Readline library focuses more on Emacs mode, but a short section in its manual, Readline vi Mode states that

While the Readline library does not have a full set of vi editing functions, it does contain enough to allow simple editing of the line.

The Readline vi mode behaves as specified in the POSIX standard.

Actual Bash implementation

After a number of experiments on two different systems, I found that the non-incremental searching in Bash/Readline does not work as described in its official documentation. I found that the * was treated as a literal asterisk rather than a pattern that matches multiple characters. Likewise, the ? and [ are also treated as literal characters.

For comparison, I tried using Vi-mode in tcsh and verified that it correctly implements history searching as specified in the POSIX standard.

I then downloaded and searched through the code for the Readline library and found its history searching functions use a simple substring search and don’t use any search pattern meta-characters – aside from the caret, ^ (see search.c from the git repository for the Readline library).

I presume the Bash/Readline developers have yet to implement this feature. I couldn’t find a bug-list but the CHANGES files shows that they’ve been regularly fixing issues relating to Vi-mode.

Update: This feature was implemented in Readline 8.0 (released with Bash 5.0 in January 2019). As documented in its CHANGES:

New Features in Readline

a. Non-incremental vi-mode search (N, n) can search for a shell pattern, as Posix specifies (uses fnmatch(3) if available).

Bash – How does storing the regular expression in a shell variable avoid problems with quoting characters that are special to the shell

[[ ... ]] tokenisation clashes with regular expressions (more on that in my answer to your follow-up question) and \ is overloaded as a shell quoting operator and a regexp operator (with some interference between the two in bash), and even when there's no apparent reason for a clash, the behaviour can be surprising. Rules can be confusing.

Who can tell what these will do without trying it (on all possible input) with any given version of bash?

[[ $a = a|b ]]
[[ $a =~ a|b ]]
[[ $a =~ a&b ]]
[[ $a =~ (a|b) ]]
[[ $a =~ ([)}]*) ]]
[[ $a =~ [/\(] ]]
[[ $a =~ \s+ ]]
[[ $a =~ ( ) ]]
[[ $a =~ [ ] ]]
[[ $a =~ ([ ]) ]]

You can't quote the regexps, because if you do, since bash 3.2 and if bash 3.1 compatibility has not been enabled, quoting the regexps removes the special meaning of RE operator. For instance,

[[ $a =~ 'a|b' ]]

Matches if $a contains a litteral a|b only.

Storing the regexp in a variable avoids all those problems and also makes the code compatible to ksh93 and zsh (provided you limit yourself to POSIX EREs):

regexp='a|b'
[[ $a =~ $regexp ]] # $regexp should *not* be quoted.

There's no ambiguity in the parsing/tokenising of that shell command, and the regexp that is used is the one stored in the variable without any transformation.

Best Answer

Related Solutions

Bash – How to search bash’s history in vi mode for “foo.*bar”

POSIX specification

Documented Bash implementation

Actual Bash implementation

Bash – How does storing the regular expression in a shell variable avoid problems with quoting characters that are special to the shell

Related Question