First, a nitpick: a string like a*
in normal shell syntax is a glob, which works differently from regular expressions.
On a high-level overview, the shell interpreter (i.e. bash) expands the string a*
to a list of every file name matching the pattern a*
. These then become part of the command line parameters to a single instance of grep
(for the programmers, all the expanded words go as separate strings into the argv
argument of main
). That single grep
command then parses the arguments in whatever way it chooses, and it is up to grep
to interpret those arguments as file names, options, option arguments, regular expressions, etc., and take the appropriate actions. Everything occurs sequentially (AFAIK no grep
implementation uses multiple threads).
If you implement a loop in a shell script to do that same thing, it is almost guaranteed to be slower than the above process, for the following reasons. If you spawn a new grep process for each file, it will most certainly be slower due to the overhead of process creation being multiplied unnecessarily. If you constructed the argument list yourself in the shell script and used a single instance of grep
, anything you do in shell will still be slower because shell commands have to interpreted (by bash), which adds an extra layer of code, and you'll just be re-implementing what bash was already doing faster internally in compiled code.
As for writing it yourself in C, you can probably easily get comparable performance to the process described in the first paragraph but it's unlikely that you'll be able to achieve enough of a performance gain over the current grep/bash implementations to justify the time spent without delving into machine-specific performance optimizations or sacrificing portability. Maybe you could try to come up with an arbitrarily parallelizable version of grep
, but even that may not help as you are more likely to be I/O bound than CPU bound. Glob expansion and grep are already "fast enough" for most "normal" purposes.
The shell expands wildcards in arguments, so most applications don't need to perform any wildcard expansion. However tar's exclude list does support wildcards, which happen to match the wildcards supported by traditional shells. Beware that there may be slight differences; for example tar doesn't distinguish *
and **
like ksh, bash and zsh can. With tar, *
can match any character including /
, so for example */.svn
excludes a file called .svn
at any level of the hierarchy. You can use tar --no-wildcards-match-slash
in which case *
doesn't match directory separators.
For example, excluding /home/me/websites/*log.txt
excludes /home/me/websites/log.txt
, /home/me/websites/foo-log.txt
and /home/me/websites/subdir/log.txt
. Excluding /home/me/websites/**/vendor
excludes /home/me/websites/one/vendor
and /home/me/websites/one/two/vendor
but not /home/me/websites/vendor
. With the --no-wildcards-match-slash
option, /home/me/websites/*log.txt
does not exclude /home/me/websites/subdir/log.txt
and /home/me/websites/**/vendor
does not exclude /home/me/websites/one/two/vendor
.
tar … --exclude='/home/www-data/sites/*include' …
excludes the files and directories under /home/www-data/sites
whose name ends with include
. You might get away without the quotes, but not if you write --exclude /home/www-data/sites/*include
(because then the shell would expand the wildcards before tar
can see them) or if you use a shell that signals an error on non-matching wildcards (e.g. zsh in its default — and recommended — configuration).
The option --exclude-from
requires a file name. The file must contain one pattern per line. Do not confuse --exclude
(followed by a pattern) and --exclude-from
(followed by the name of a file containing patterns).
Best Answer
Globs are not regular expressions. In general, the shell will try to interpret anything you type on the command line that you don't quote as a glob. Shells are not required to support regular expressions at all (although in reality many of the fancier more modern ones do, e.g. the
=~
regex match operator in thebash
[[
construct).The
.??*
is a glob. It matches any file name that begins with a literal dot.
, followed by any two (not necessarily the same) characters,??
, followed by the regular expression equivalent of[^/]*
, i.e. 0 or more characters that are not/
.For the full details of shell pathname expansion (the full name for "globbing"), see the POSIX spec.