Shell – Glob files *without* using a shell

shellwildcards

I want to list files in a certain subdirectory, but I'm doing so as part of a docker exec inside a docker container, so I don't want to bother starting up a shell that I don't really need. Is it possible to find all the matches for a glob with a simple command line tool, and not just a shell?

For example, my current invocation is bash -l -c 'echo /usr/local/conda-meta/*.json'. Is it possible to simplify this using a commonly available tool, resulting in something like globber /usr/local/conda-meta/*.json, which would be much simpler and lighter weight?

Best Answer

sh is simple and commonly available. sh is the tool that is invoked to parse command lines in things like system(cmdline) in many languages. Many OSes including some GNU ones have stopped using bash (the GNU shell) to implement sh for the reason that it has become too bloated to do just that simple thing of parsing command lines and interpreting POSIX sh scripts.

Your bash -l -c 'echo /usr/local/conda-meta/*.json' command line is possibly being interpreted by a sh invocation already. So possibly you can just do:

printf '%s\n' /usr/local/conda-meta/*.json

directly. If not:

sh -c 'printf "%s\n" /usr/local/conda-meta/*.json'

You could also use find here. find doesn't do globbing but it can report file names that match patterns similar to shell ones.

LC_ALL=C find /usr/local/conda-meta/. ! -name . -prune -name '*.json'

Or with some find implementations:

LC_ALL=C find /usr/local/conda-meta -mindepth 1 -maxdepth 1 -name '*.json'

(note that the LC_ALL=C needed here so that * matches any sequence of bytes, not just those that are forming valid characters in the current locale, is a shell construct. If that command line is not interpreted by a shell, you may need to change it to env LC_ALL=C find...)

Some differences with shell globs:

  • the list of files is not sorted
  • hidden files are included (you could add a ! -name '.*' to exclude them)
  • you get no output if there's no matching file. globs have that misfeature that they leave the pattern as-is unexpanded in that case.
  • with the first (standard) variant, files will be output as /usr/local/conda-meta/./file.json.
  • some globs such as x*/y/../*z are not easily translated (also note the differing behaviour with respect to symlinks to directories in that case).

In any case, you can't use echo to output arbitrary data.

My next question would be: what are you going to do with that output? With echo, you're outputting those file paths separated by SPC characters, and with my printf or find above, delimited by NL characters. Both NL and SPC are perfectly valid characters in file names, so those outputs are not post-processable reliable. You could use '%s\0' instead of '%s\n' (or use find's -print0 if supported), not suitable for display to a user, but post-processable.

In terms of efficiency, comparing Ubuntu 20.04's /bin/sh (dash 0.5.10.2) with its find (GNU find 4.7.0).

Startup time:

$ time (repeat 1000 sh -c '')
( repeat 1000; do; sh -c ''; done; )  0.91s user 0.66s system 105% cpu 1.483 total
$ time (repeat 1000 find . -quit)
( repeat 1000; do; find . -quit; done; )  1.35s user 1.25s system 103% cpu 2.507 total

Globbing some json files:

$ TIMEFMT='%U user %S system %P cpu %*E total'
$ time (repeat 1000 sh -c 'printf "%s\n" /usr/share/iso-codes/json/*.json') > /dev/null
0.95s user 0.72s system 105% cpu 1.587 total
$ time (repeat 1000  find /usr/share/iso-codes/json -mindepth 1 -maxdepth 1 -name '*.json') > /dev/null
1.34s user 1.35s system 103% cpu 2.599 total

Even bash is hardly slower than find here:

$ time (repeat 1000 bash -c 'printf "%s\n" /usr/share/iso-codes/json/*.json') > /dev/null
1.53s user 1.36s system 102% cpu 2.808 total

Of course YMMV depending on the system, implementation, version of the respective utilities and the libraries they're linked against.

Now on the history note, the glob name actually comes from the name of a utility called glob in the very first versions of Unix in the early 70s. It was located in /etc and was invoked by sh as a helper to expand wildcard patterns.

You'll find a few projects online to revive that very old shell such as https://etsh.nl/. More as an exercise in archaeology, you could build the glob utility from there and then be able to do:

glob printf '%s\n' '/usr/local/conda-meta/*.json'

A few notes of warning though.

  • those are ancient globs, [!x] (let alone [^x]) is not supported.
  • it's not 8 bit safe. Actually, the 8th bit is used for escaping the glob operators ($'\xe9*' would match the same thing as i*, $'\xaa*' would match on filenames that start with *; the shell would set that 8th bit for the quoted characters before invoking glob)
  • ranges like [a-f] match on byte value rather than collation order (in practice, that's generally an advantage IMO).
  • Non-matching globs result in a No match error (again, probably preferably, that's something that was broken by the Bourne shell in the late 70s).

The glob functionality was later moved into the shell starting with the PWB shell and Bourne shell in the late 70s. Later, some fnmatch() and glob() functions were added to the C library to allow that feature to be used from other applications, but I'm not aware of a standard nor common utility that is a bare interface to that function. Even perl used to invoke csh in its early days to expand glob patterns.

Related Question