Bash – How does a shell (bash, for example) expand wildcard patterns

bashshellwildcards

Assume that a directory has 100 files starting with the letter 'a'.

If I do a grep <some string> a* from the terminal, how will the shell handle this?

Will it expand the regular expression, get a list of all files starting with a and grep on each one of those sequentially? Or is there some other way?

Assume that I have an array of the above filenames that begin with 'a'. Will it take more/less time if I write a for loop and do the iteration myself in a shell script or a c program?

Best Answer

First, a nitpick: a string like a* in normal shell syntax is a glob, which works differently from regular expressions.

On a high-level overview, the shell interpreter (i.e. bash) expands the string a* to a list of every file name matching the pattern a*. These then become part of the command line parameters to a single instance of grep (for the programmers, all the expanded words go as separate strings into the argv argument of main). That single grep command then parses the arguments in whatever way it chooses, and it is up to grep to interpret those arguments as file names, options, option arguments, regular expressions, etc., and take the appropriate actions. Everything occurs sequentially (AFAIK no grep implementation uses multiple threads).

If you implement a loop in a shell script to do that same thing, it is almost guaranteed to be slower than the above process, for the following reasons. If you spawn a new grep process for each file, it will most certainly be slower due to the overhead of process creation being multiplied unnecessarily. If you constructed the argument list yourself in the shell script and used a single instance of grep, anything you do in shell will still be slower because shell commands have to interpreted (by bash), which adds an extra layer of code, and you'll just be re-implementing what bash was already doing faster internally in compiled code.

As for writing it yourself in C, you can probably easily get comparable performance to the process described in the first paragraph but it's unlikely that you'll be able to achieve enough of a performance gain over the current grep/bash implementations to justify the time spent without delving into machine-specific performance optimizations or sacrificing portability. Maybe you could try to come up with an arbitrarily parallelizable version of grep, but even that may not help as you are more likely to be I/O bound than CPU bound. Glob expansion and grep are already "fast enough" for most "normal" purposes.