Bash – Rename a lot of files with all types of characters, with POSIX portability

bashfilenamesrenameshell

Sometimes I need to rename all the files (the renaming convention follows later) in a directory where the filename is always in the form of 'filenamename.extension' (extension always exists and varies). The name may contain whitespaces and characters from the [:graph:] class. My first problem is that it should be absolutely portable between *NIX systems (especially Linux, BSD, later other systems, say AIX). My second problem is with the [:graph:] class. Filenames could be:

cat.txt
dog_and_cat.txt
Where is the cat?.png
my.cat.is.cute.txt.js.html
;;; ;;; ;;;.......321
áéúő _[a lot of whitespaces]_ óü^^^^^ö.jpg

Easy to see, those are difficult to handle and put into a for loop. For example, the

for i in *; do something; done

doesn't always like whitespaces and the weird characters, especially in different operating systems.

The renaming convention is to rename all files to the $FOOBAR.$EXTENSION form where $FOOBAR is some kind of hash, for example md5sum. So in the for loop I have got a line which is like

mv $FILE $(md5sum $FILE | sed 's/\ \ .\+//');

It will move the file to the md5sum of itself, but the extension is disappeared. I want to preserve the extensions, which are almost always in the .[a-zA-Z0-9]{1,3} form. Occasionally there are extensions like .tar.gz which are also needed to be preserved (certainly I could add them into a variable, say MYEXTENSIONS='tar.gz tar.bz2 foo.bar').

My intuition tells me that the problem is solvable with well-parametrized default UNIX/shell commands, but it is extremely difficult for me now. I'm sure I'll learn a lot from the answers. I know I said the magical word portability, but the solution is preferred in bash, if I must specify the language.

Best Answer

Actually, for i in *; do something; done treats every file name correctly, except that file names that begin with a . are excluded from the wildcard matching. To match all files (except . and ..) portably, match * .[!.]* ..?* and skip any nonexistent file resulting from a non-matching pattern being left intact.

If you experienced problems, it's probably because you didn't quote $i properly later on. Always put double quotes around variable substitutions and command substitutions: "$foo", "$(cmd)" unless you intend field splitting and globbing to happen.

If you need to pass the file name to an external command (you don't, here), be careful that echo "$foo" does not always print $foo literally. A few shells perform backslash expansion, and a few values of $foo beginning with - will be treated as an option. The safe and POSIX-compliant way to print a string exactly is

printf '%s' "$foo"

or printf '%s\n' "$foo" to add a newline at the end. Another thing to watch out for is that command substitution removes trailing newlines; if you need to retain newlines, a possible trick is to append a non-newline character to the data, make sure the transformation retains this character, and finally truncate this character. For example:

mangled_file_name="$(printf '%sa' "$file_name" | tr -sc '[:alnum:]-+_.' '[_*]')"
mangled_file_name="${mangled_file_name%a}"

To extract the md5sum of the file, avoid having the file name in the md5sum output, since that will make it hard to strip. Pass the data on md5sum's standard input.

Note that the md5sum command is not in POSIX. A few unix variants have md5 or nothing at all. cksum is POSIX but collision-prone.

See Grabbing the extension in a file name on how to get the file's extension.

Let's put it all together (untested). Everything here works under any POSIX shell; you could gain a little, but not much, from bash features.

for old_name in * .[!.]* ..?*; do
  if ! [ -e "$old_name" ]; then continue; fi
  hash=$(md5sum <"$old_name")
  case "$old_name" in
    *.*.gz|*.*.bz2)                   # double extension
      ext=".${old_name##*.}"
      tmp="${old_name%.*}"
      ext=".${old_name##*.}$ext";;
    ?*.*) ext=".${old_name##*.}";;    # simple extension
    *) ext=;;                         # no extension
  esac
  mv -- "$old_name" "$hash$ext"
done

Note that I did not consider the case where there is already a target file by the specified name. In particular, if you have existing files whose name looks like your adopted convention but where the checksum part doesn't match the file's contents and instead matches that of some other file with the same extension, what happens will depend on the relative lexicographic order of the file names.

Related Question