This question is prompted by a short script I found in a Linux magazine. As evidence that I didn't make this up, here's a picture of it:
I would like to write a letter to the editor of this publication about what's wrong with this and how to write it better.
The script attempts to capture jpeg files into a variable, so that something (compression using lepton
) can be done with them.
for jpeg in `echo "$(file $(find ./ ) |
grep JPEG | cut -f 1 -d ':')"`
do
/path/to/command "$jpeg"
...
Apparently in this instance we can't trust the files to be named with a .jpg
extension so we can't catch them with something like
for f in *.JPG *.jpg *.JPEG *.jpeg ; do ...
because the writer has used file
to check their type, but if the filenames can't be trusted to have a sensible extension, then I don't see how we can trust them not to be -rf *
or (; \ $!|
or have newlines or whatever else.
How can I sanely capture files into a variable by type with for
or while
, or perhaps avoid doing so by using find
with -exec
, or some other method?
Bonus for insights into and demonstrations of what's wrong with the code in the picture.
I've tagged this question with [bash] since it's about a bash script, but if you feel like answering with a way to do this that doesn't use bash, then please feel free to do that.
Best Answer
Code first:
Let's do this with Bash's special globs and a
for
loop:Explanation:
First of all, we need to make the Bash globs more useful by enabling the
globstar
anddotglob
shell options. Here is their description fromman bash
in the SHELL BUILTIN COMMANDS section aboutshopt
:Then we use this new "recursive glob"
./**
in afor
loop to iterate over all files and folders inside the current directory and all its subdirectories. Please always use absolute paths or explicit relative paths starting with a./
or../
in your globs, not just**
, to prevent problems with special file names like~
.Now we test each file (and folder) name with the
file
command for its contents. The-b
option prevents it from printing the file name again before the content information string, which makes filtering more safe.Now we know that the content information of all valid JPG/JPEG files must start with
JPEG image data,
, which is what we test the output offile
for withgrep
. We use the-q
option to suppress any output, as we are only interested ingrep
's exit code, which indicates if the pattern matched or not.If it matched, the code inside the
if
/then
block will be executed. We can do anything we want in here. The current JPEG filename is available in the shell variable$f
. We just have to make sure to always put it in double quotes to prevent the accidental evaluation of filenames with special characters like spaces, newlines, or symbols. It is also usually best to separate it from other arguments by placing it after--
, which causes most commands to interpret it as a filename even if it's something like-v
or--help
that would otherwise be interpreted as an option.Bonus question:
Time to blow up some code, for science! Here is the version from your question/book:
First of all, allow me to mention how complex they wrote it. We have 4 levels of nested subshells, using mixed command substitution syntaxes (
``
and$()
), which are just necessary because of the incorrect/suboptimal usage offind
.Here
find
just lists all files and prints their names, one per line. Then the full output is passed tofile
to examine each of them. But wait! One file name per line? What about file names containing newlines? Right, those will break it!Actually even simple spaces break it too, because those are treated as separators as well by
file
. You can't even quote the"$(find ./ )"
here as a remedy, because that would then quote the whole multi-line output as one single filename argument.Next step, the
file
output gets scanned withgrep JPEG
. Don't you think it's a bit easy to trick such a simple pattern, especially as the output of plainfile
always contains the file name as well? Basically everything with "JPEG" in its file name will trigger a match, no matter what it contains.Okay, so we have the
file
output of all JPEG files (or those who pretend to be one), now they process all lines withcut
to extract the original file name from the first column, separated by a colon... Guess what, let's try this on a file with a colon in its name:So to conclude, the approach from your book works, but only if all files it checks do not contain any spaces, newlines, colons and probably other special characters and do not contain the string "JPEG" anywhere in their filenames. It is also kind of ugly, but as beauty lies in the eye of the beholder, I'm not going to ramble about that.