Bash – How does a shell (bash, for example) expand wildcard patterns

bashshellwildcards

Assume that a directory has 100 files starting with the letter 'a'.

If I do a grep <some string> a* from the terminal, how will the shell handle this?

Will it expand the regular expression, get a list of all files starting with a and grep on each one of those sequentially? Or is there some other way?

Assume that I have an array of the above filenames that begin with 'a'. Will it take more/less time if I write a for loop and do the iteration myself in a shell script or a c program?

Best Answer

First, a nitpick: a string like a* in normal shell syntax is a glob, which works differently from regular expressions.

On a high-level overview, the shell interpreter (i.e. bash) expands the string a* to a list of every file name matching the pattern a*. These then become part of the command line parameters to a single instance of grep (for the programmers, all the expanded words go as separate strings into the argv argument of main). That single grep command then parses the arguments in whatever way it chooses, and it is up to grep to interpret those arguments as file names, options, option arguments, regular expressions, etc., and take the appropriate actions. Everything occurs sequentially (AFAIK no grep implementation uses multiple threads).

If you implement a loop in a shell script to do that same thing, it is almost guaranteed to be slower than the above process, for the following reasons. If you spawn a new grep process for each file, it will most certainly be slower due to the overhead of process creation being multiplied unnecessarily. If you constructed the argument list yourself in the shell script and used a single instance of grep, anything you do in shell will still be slower because shell commands have to interpreted (by bash), which adds an extra layer of code, and you'll just be re-implementing what bash was already doing faster internally in compiled code.

As for writing it yourself in C, you can probably easily get comparable performance to the process described in the first paragraph but it's unlikely that you'll be able to achieve enough of a performance gain over the current grep/bash implementations to justify the time spent without delving into machine-specific performance optimizations or sacrificing portability. Maybe you could try to come up with an arbitrarily parallelizable version of grep, but even that may not help as you are more likely to be I/O bound than CPU bound. Glob expansion and grep are already "fast enough" for most "normal" purposes.

Related Solutions

Shell Command – Meaning of .??* Wildcard

Globs are not regular expressions. In general, the shell will try to interpret anything you type on the command line that you don't quote as a glob. Shells are not required to support regular expressions at all (although in reality many of the fancier more modern ones do, e.g. the =~ regex match operator in the bash [[ construct).

The .??* is a glob. It matches any file name that begins with a literal dot ., followed by any two (not necessarily the same) characters, ??, followed by the regular expression equivalent of [^/]*, i.e. 0 or more characters that are not /.

For the full details of shell pathname expansion (the full name for "globbing"), see the POSIX spec.

Bash – How to iterate a zero padded integer in bash

In bash, numbers with leading zeros are considered as octal. To force bash to consider them as decimal, you can add a 10# prefix:

next_number=$(printf %06d "$((10#$current_number + 1))")

Or with bash 3.1 or above, to avoid the forking:

printf -v next_number %06d "$((10#$current_number + 1))"

(note that it doesn't work for negative numbers as 10#-010 is seen as 10#0 - 010 in bash, so both $((10#-10)) and $((-10#-10)) expand to -8).

Best Answer

Related Solutions

Shell Command – Meaning of .??* Wildcard

Bash – How to iterate a zero padded integer in bash

Related Question