Ubuntu – How to operate on all files of a certain type if they might not have the right extension

bashcommand linescripts

This question is prompted by a short script I found in a Linux magazine. As evidence that I didn't make this up, here's a picture of it:

I would like to write a letter to the editor of this publication about what's wrong with this and how to write it better.

The script attempts to capture jpeg files into a variable, so that something (compression using lepton) can be done with them.

for jpeg in `echo "$(file $(find ./ ) |
   grep JPEG | cut -f 1 -d ':')"`
  do
     /path/to/command "$jpeg"
...

Apparently in this instance we can't trust the files to be named with a .jpg extension so we can't catch them with something like

for f in *.JPG *.jpg *.JPEG *.jpeg ; do ...

because the writer has used file to check their type, but if the filenames can't be trusted to have a sensible extension, then I don't see how we can trust them not to be -rf * or (; \ $!| or have newlines or whatever else.

How can I sanely capture files into a variable by type with for or while, or perhaps avoid doing so by using find with -exec, or some other method?

_{Bonus for insights into and demonstrations of what's wrong with the code in the picture.}

_{I've tagged this question with [bash] since it's about a bash script, but if you feel like answering with a way to do this that doesn't use bash, then please feel free to do that.}

Best Answer

Code first:

Let's do this with Bash's special globs and a for loop:

#!/bin/bash
shopt -s globstar dotglob

for f in ./** ; do 
    if file -b -- "$f" | grep -q '^JPEG image data,' ; then 

        # do whatever you want with the JPEG file "$f" in here:
        md5sum -- "$f"

    fi
done

Explanation:

First of all, we need to make the Bash globs more useful by enabling the globstar and dotglob shell options. Here is their description from man bash in the SHELL BUILTIN COMMANDS section about shopt:

 dotglob 
    If set, bash includes filenames beginning with a `.' in the results of 
    pathname expansion.
 globstar
    If set, the pattern ** used in a pathname expansion context will match 
    all files and zero or more directories and subdirectories. If the pattern
    is followed by a /, only directories and subdirectories match.

Then we use this new "recursive glob" ./** in a for loop to iterate over all files and folders inside the current directory and all its subdirectories. Please always use absolute paths or explicit relative paths starting with a ./ or ../ in your globs, not just **, to prevent problems with special file names like ~.

Now we test each file (and folder) name with the file command for its contents. The -b option prevents it from printing the file name again before the content information string, which makes filtering more safe.

Now we know that the content information of all valid JPG/JPEG files must start with JPEG image data,, which is what we test the output of file for with grep. We use the -q option to suppress any output, as we are only interested in grep's exit code, which indicates if the pattern matched or not.

If it matched, the code inside the if/then block will be executed. We can do anything we want in here. The current JPEG filename is available in the shell variable $f. We just have to make sure to always put it in double quotes to prevent the accidental evaluation of filenames with special characters like spaces, newlines, or symbols. It is also usually best to separate it from other arguments by placing it after --, which causes most commands to interpret it as a filename even if it's something like -v or --help that would otherwise be interpreted as an option.

Bonus question:

Time to blow up some code, for science! Here is the version from your question/book:

for jpeg in `echo "$(file $(find ./ ) 
    | grep JPEG | cut -f 1 -d ':')"`
do
     /path/to/command "$jpeg"
done

First of all, allow me to mention how complex they wrote it. We have 4 levels of nested subshells, using mixed command substitution syntaxes (`` and $()), which are just necessary because of the incorrect/suboptimal usage of find.

Here find just lists all files and prints their names, one per line. Then the full output is passed to file to examine each of them. But wait! One file name per line? What about file names containing newlines? Right, those will break it!

$ ls --escape ne*ne
new\nline
$ file $(find . -name 'ne*ne' )
./new: cannot open `./new' (No such file or directory)
line:  cannot open `line' (No such file or directory)

Actually even simple spaces break it too, because those are treated as separators as well by file. You can't even quote the "$(find ./ )" here as a remedy, because that would then quote the whole multi-line output as one single filename argument.

$ ls simple*
simple spaces.jpg
$ file $(find ./ -name 'simple*')
./simple:   cannot open `./simple' (No such file or directory)
spaces.jpg: cannot open `spaces.jpg' (No such file or directory)

Next step, the file output gets scanned with grep JPEG. Don't you think it's a bit easy to trick such a simple pattern, especially as the output of plain file always contains the file name as well? Basically everything with "JPEG" in its file name will trigger a match, no matter what it contains.

$ echo "to be or not to be" > IAmNoJPEG.txt
$ file IAmNoJPEG.txt | grep JPEG
IAmNoJPEG.txt: ASCII text

Okay, so we have the file output of all JPEG files (or those who pretend to be one), now they process all lines with cut to extract the original file name from the first column, separated by a colon... Guess what, let's try this on a file with a colon in its name:

$ ls colon*
colons:evil.jpeg
$ file colon* | grep JPEG | cut -f 1 -d ':'
colons

So to conclude, the approach from your book works, but only if all files it checks do not contain any spaces, newlines, colons and probably other special characters and do not contain the string "JPEG" anywhere in their filenames. It is also kind of ugly, but as beauty lies in the eye of the beholder, I'm not going to ramble about that.

Related Solutions

Ubuntu – What’s the difference between set, export and env and when should I use each

Let us consider a specific example. The grep command uses an environment variable called GREP_OPTIONS to set default options.

Now. Given that the file test.txt contains the following lines:

line one
line two

running the command grep one test.txt will return

line one

If you run grep with the -v option, it will return the non-matching lines, so the output will be

line two

We will now try to set the option with an environmental variable.

Environment variables set without export will not be inherited in the environment of the commands you are calling.
```
GREP_OPTIONS='-v'
grep one test.txt
```
The result:
```
line one
```
Obviously, the option -v did not get passed to grep.

You want to use this form when you are setting a variable only for the shell to use, for example in for i in * ; do you do not want to export $i.
However, the variable is passed on to the environment of that particular command line, so you can do
```
GREP_OPTIONS='-v' grep one test.txt
```
which will return the expected
```
line two
```
You use this form to temporarily change the environment of this particular instance of the program launched.
Exporting a variable causes the variable to be inherited:
```
export GREP_OPTIONS='-v'
grep one test.txt
```
returns now
```
line two
```
This is the most common way of setting variables for use of subsequently started processes in a shell
This was all done in bash. export is a bash builtin; VAR=whatever is bash syntax. env, on another hand, is a program in itself. When env is called, following things happen:
1. The command env gets executed as a new process
2. env modifies the environment, and
3. calls the command that was provided as an argument. The env process is replaced by the command process.
Example:
```
env GREP_OPTIONS='-v' grep one test.txt
```
This command will launch two new processes: (i) env and (ii) grep (actually, the second process will replace the first one). From the point of view of the grep process, the result is exactly the same as running
```
GREP_OPTIONS='-v' grep one test.txt
```
However, you can use this idiom if you are outside of bash or don't want to launch another shell (for example, when you are using the exec() family of functions rather than the system() call).

Additional note on `#!/usr/bin/env`

This is also why the idiom #!/usr/bin/env interpreter is used rather than #!/usr/bin/interpreter. env does not require a full path to a program, because it uses the execvp() function which searches through the PATH variable just like a shell does, and then replaces itself by the command run. Thus, it can be used to find out where an interpreter (like perl or python) "sits" on the path.

It also means that by modifying the current path you can influence which python variant will be called. This makes the following possible:

echo -e '#!/usr/bin/bash\n\necho I am an evil interpreter!' > python
chmod a+x ./python
export PATH=.
python

instead of running Python, will result in

I am an evil interpreter!

Ubuntu – How to run a bash script without affecting the file system

Try using an overlay, with a chroot. First, decide the path you want to chroot to, and make sure it exists, and similarly for the path you will overlay on / (which is where modifications will go):

mkdir -p /chroot
mkdir -p /tmp/tmproot

I chose a directory in /tmp/ as it's a tmpfs on my system (possibly unadvisable, but OK for me), so no changes should reach the disk. You can use a squashfs and mount it somewhere, and use that as the overlay, but that has the problem of being read-only, I think.

Now:

$ mount -t overlayfs -o lowerdir=/,upperdir=/tmp/tmproot overlayfs /chroot/
$ chroot /chroot/ /bin/bash -l
root:/$ touch test
root:/$ ls
...  sys  test  tmp  ...
root:/$ logout
$ ls /
...  sys  tmp  ...
$ ls /tmp/tmproot/
root  test

If you make the upperdir independent of a physical disk (perhaps by using tmpfs), this should protect the lowerdir.

Note the creation of a root folder - that's for my .bash_history. A copy was made of the original .bash_history, and then appended to.