Ubuntu – How to operate on all files of a certain type if they might not have the right extension

bashcommand linescripts

This question is prompted by a short script I found in a Linux magazine. As evidence that I didn't make this up, here's a picture of it:

quite awful code sample

I would like to write a letter to the editor of this publication about what's wrong with this and how to write it better.

The script attempts to capture jpeg files into a variable, so that something (compression using lepton) can be done with them.

for jpeg in `echo "$(file $(find ./ ) |
   grep JPEG | cut -f 1 -d ':')"`
  do
     /path/to/command "$jpeg"
...

Apparently in this instance we can't trust the files to be named with a .jpg extension so we can't catch them with something like

for f in *.JPG *.jpg *.JPEG *.jpeg ; do ...

because the writer has used file to check their type, but if the filenames can't be trusted to have a sensible extension, then I don't see how we can trust them not to be -rf * or (; \ $!| or have newlines or whatever else.

How can I sanely capture files into a variable by type with for or while, or perhaps avoid doing so by using find with -exec, or some other method?

Bonus for insights into and demonstrations of what's wrong with the code in the picture.

I've tagged this question with [bash] since it's about a bash script, but if you feel like answering with a way to do this that doesn't use bash, then please feel free to do that.

Best Answer

Code first:

Let's do this with Bash's special globs and a for loop:

#!/bin/bash
shopt -s globstar dotglob

for f in ./** ; do 
    if file -b -- "$f" | grep -q '^JPEG image data,' ; then 

        # do whatever you want with the JPEG file "$f" in here:
        md5sum -- "$f"

    fi
done

Explanation:

First of all, we need to make the Bash globs more useful by enabling the globstar and dotglob shell options. Here is their description from man bash in the SHELL BUILTIN COMMANDS section about shopt:

 dotglob 
    If set, bash includes filenames beginning with a `.' in the results of 
    pathname expansion.
 globstar
    If set, the pattern ** used in a pathname expansion context will match 
    all files and zero or more directories and subdirectories. If the pattern
    is followed by a /, only directories and subdirectories match.

Then we use this new "recursive glob" ./** in a for loop to iterate over all files and folders inside the current directory and all its subdirectories. Please always use absolute paths or explicit relative paths starting with a ./ or ../ in your globs, not just **, to prevent problems with special file names like ~.

Now we test each file (and folder) name with the file command for its contents. The -b option prevents it from printing the file name again before the content information string, which makes filtering more safe.

Now we know that the content information of all valid JPG/JPEG files must start with JPEG image data,, which is what we test the output of file for with grep. We use the -q option to suppress any output, as we are only interested in grep's exit code, which indicates if the pattern matched or not.

If it matched, the code inside the if/then block will be executed. We can do anything we want in here. The current JPEG filename is available in the shell variable $f. We just have to make sure to always put it in double quotes to prevent the accidental evaluation of filenames with special characters like spaces, newlines, or symbols. It is also usually best to separate it from other arguments by placing it after --, which causes most commands to interpret it as a filename even if it's something like -v or --help that would otherwise be interpreted as an option.


Bonus question:

Time to blow up some code, for science! Here is the version from your question/book:

for jpeg in `echo "$(file $(find ./ ) 
    | grep JPEG | cut -f 1 -d ':')"`
do
     /path/to/command "$jpeg"
done

First of all, allow me to mention how complex they wrote it. We have 4 levels of nested subshells, using mixed command substitution syntaxes (`` and $()), which are just necessary because of the incorrect/suboptimal usage of find.

Here find just lists all files and prints their names, one per line. Then the full output is passed to file to examine each of them. But wait! One file name per line? What about file names containing newlines? Right, those will break it!

$ ls --escape ne*ne
new\nline
$ file $(find . -name 'ne*ne' )
./new: cannot open `./new' (No such file or directory)
line:  cannot open `line' (No such file or directory)

Actually even simple spaces break it too, because those are treated as separators as well by file. You can't even quote the "$(find ./ )" here as a remedy, because that would then quote the whole multi-line output as one single filename argument.

$ ls simple*
simple spaces.jpg
$ file $(find ./ -name 'simple*')
./simple:   cannot open `./simple' (No such file or directory)
spaces.jpg: cannot open `spaces.jpg' (No such file or directory)

Next step, the file output gets scanned with grep JPEG. Don't you think it's a bit easy to trick such a simple pattern, especially as the output of plain file always contains the file name as well? Basically everything with "JPEG" in its file name will trigger a match, no matter what it contains.

$ echo "to be or not to be" > IAmNoJPEG.txt
$ file IAmNoJPEG.txt | grep JPEG
IAmNoJPEG.txt: ASCII text

Okay, so we have the file output of all JPEG files (or those who pretend to be one), now they process all lines with cut to extract the original file name from the first column, separated by a colon... Guess what, let's try this on a file with a colon in its name:

$ ls colon*
colons:evil.jpeg
$ file colon* | grep JPEG | cut -f 1 -d ':'
colons

So to conclude, the approach from your book works, but only if all files it checks do not contain any spaces, newlines, colons and probably other special characters and do not contain the string "JPEG" anywhere in their filenames. It is also kind of ugly, but as beauty lies in the eye of the beholder, I'm not going to ramble about that.