Shell – Recursively compare directory contents by name, ignoring file extensions

diff()directoryfilenamesshell-script

I have a directory containing about 7,000 music files. I used lame to recursively re-encode all files in it to a separate directory, outputting all files with the same relative path and file name. The output files have a .mp3 extension, but some of the input files had different extensions (.wma, .aac, etc).

I can see that there is a file count difference of ~100 files missing in the output directory. What I want to do is run a compare of the two directories and obtain a list of the files that exist in the source, but not in the destination. This would be simple enough except I need to ignore differences in file extension.

I've tried using rsync with dry-run turned on but I couldn't figure out a way to ignore file extensions. I've also tried diff but was unable to find an option to only check by name but ignore file extensions. I started thinking I could just do a recursive ls on both directories, remove the file extensions, and then compare the outputs, but I really have no idea on where to start with modifying the ls output using sed or awk.

Best Answer

To see a listing, here are two variants, one that recurses into subdirectories and one that doesn't. All use syntax specific to bash, ksh and zsh.

comm -3 <(cd source && find -type f | sed 's/\.[^.]*$//' | sort) \
        <(cd dest && find -type f | sed 's/\.[^.]*$//' | sort)
comm -3 <(cd source && for x in *; do printf '%s\n' "${x%.*}"; done | sort) \
        <(cd dest && for x in *; do printf '%s\n' "${x%.*}"; done | sort)

Shorter, in zsh:

comm -3 <(cd source && print -lr **/*(:r)) <(cd dest && print -lr **/*(:r))
comm -3 <(print -lr source/*(:t:r)) <(print -lr dest/*(:t:r))

The comm command lists the lines that are common to two files (comm -12), that are only in the first file (comm -23) or that are only in the second file (comm -13). The numbers indicate what is subtracted from the output¹. The two input files must be sorted.

Here, the files are in fact the output of a command. The shell evaluates the <(…) construct by providing a “fake” file (a FIFO or a /dev/fd/ named file descriptor) as the argument to the command.

¹ So here the minus sayers are fully justified.


If you want to perform actions on the files, you'll probably want to iterate over the source files.

cd source
for x in *; do
  set -- "…/dest/${x%.*}".*
  if [ $# -eq 1 ] && ! [ -e "$1" ]; then
    echo "$x has not been converted"
  elif [ $# -gt 1 ]; then
    echo "$x has been converted to more than one output file: " "$@"
  else
    echo "$x has been converted to $1"
  fi
done