Shell – History of Bash globbing

historyregular expressionshellwildcards

Is there a historical reason why Bash "globbing" and regular expressions are not identical? For example, I believe that in Bash [1-2]* matches anything that starts with a 1 or a 2 followed by anything else, while as a regular expression [1-2]* would match only a sequence of 1s and 2s. My Bash scripting and REGEX foo are both pretty weak and I regularly run into problems associated with these differences which made me curious was to why they are different.

Best Answer

bash was initially designed in the late 80s as a partial clone of ksh with some interactive features from csh/tcsh.

The origins of globbing have to be found in those earlier shells which it builds upon.

ksh itself is an extension of the Bourne shell. The Bourne shell itself (first released in 1979 in Unix V7) was a clean implementation from scratch, but it did not depart completely from the Thompson shell (the shell of V1 -> V6) and incorporated features from the Mashey shell.

In particular, command arguments were still separated by blanks, | was now the new pipe operator but ^ was still supported as an alternative (and also explains why you do [!a-z] and not [^a-z]), $1 was still the first argument to a script and backslash was still the escape character. So many of the regexp operators (^\|$) have a special meaning of their own in the shell.

The Thompson shell relied on an external utility for globbing. When sh found unquoted *, [ or ?s in the command, it would run the command through glob.

rm *.txt

would end up running glob as:

["glob", "rm", "*.txt"]

and glob would end up running rm with the list of files matching that pattern.

grep a.\*b *.txt

would run glob as:

["glob", "grep", "a.\252b", "*.txt"]

The * above has been quoted by setting the 8th bit on that character, preventing glob from treating it as a wildcard. glob would then remove that bit before calling grep.

To do the equivalent with regexps, that would have been:

regexp rm '\.txt$'

Or:

regexp rm '^[^.].*\.txt$'

to exclude dot-files.

The need to escape the operators as they double as shell special characters, the fact that ., common in filenames is a regexp operator makes it not very appropriate to match filenames and complicated for a beginner. In most cases, all you need is wildcards that can replace either one (?) or any number (*) of characters.

Now, different shells added different globbing operators. Nowadays, the ksh and zsh globs (and to some extent bash -O extglob which implements a subset of ksh globs) are functionally equivalent to regexps with a syntax that is less cumbersome to use with filenames and the current shell syntax. For instance, in zsh (with extendedglob extension), you can do:

echo a#.txt

if you want (unlikely) to match filenames that consist of sequences of a followed by .txt. Easier than echo (^a*\.txt$) (here using braces as a way to isolate the regex operators from the shell operators which could have been one way shells could deal with it).

echo (foo|bar|<1-20>).(#i)mpg

For mpg files (case insensitive) whose basename is foo, bar or a decimal number from 1 to 20...

ksh93 now can also incorporate regexps (basic, extended, perl-like or "augmented") in its globs (though it's quite buggy) and even provides a tool to convert between glob and regexp (printf %R, printf %P):

echo ~(Ei:.*\.txt)

to match (non-hidden) txt files with Extended regular expressions, case-insensitively.