Command-Line – Rename Files and Directories with French Characters

command lineregular expressionrename

I use the following command on Ubuntu with rename (installed with sudo apt-get rename) to rename all files which have the given characters in regex:

find . -execdir rename 's/[^A-Za-z0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;

This is working very well and all other characters are changed to ?. Now I want to include French characters like àèìòù and so on. So I added À-ÿ to my regex:

find . -execdir rename 's/[^A-Za-zÀ-ÿ0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;

But somehow the files are not getting renamed and they seem to be corrupted after running this command with À-ÿ because I can't delete them anymore.

What is the right way to include them in the rename regex?

Best Answer

Assuming those file names are encoded in UTF-8, use:

find . -depth -execdir rename -n '
  utf8::decode$_ or die "cannot decode $_\n";
  s{[^\w.\@+,#!?:&%~()\[\]/ -]}{?}gs;
  utf8::encode$_;
  ' {} +

(remove the -n when happy).

Beware that some BSD implementations of find do not prefix the file names with ./ with -execdir so that command could fail if there are file names that start with -. With your variant of rename, you should be able to work around it by changing rename -n to rename -n -- (that doesn't work will all other perl rename variants).

In modern versions of perl, \w (for word character) is any alphanumeric (in any alphabetic script, not just Latin), or underscore character plus other connector punctuation chararcters plus Unicode marks (so for instance, includes the combining acute accent character that follows e in the decomposed form of é).

If you wanted to be more restrictive, instead of \w, you could use \p{latin}\p{mark}0-9_ to only include letters in the Latin script (and not Cyrillic, Greek...), the combining diacritics (though not limited to those typically used with Latin letters), and only the Hindu–Arabic decimal digits (and not other kinds of digits) and underscore (and not other connector punctuation characters).

If you don't use utf8::decode, perl will assume the characters are encoded in the iso8859-1 unibyte character set (for instance where 0xc3 0xa9 (the UTF-8 encoding of the pre-composed form of é) is Ã ©).

Alternatively, you can use zsh (which will decode characters as per the locale's encoding (see the output of locale charmap)):

autoload zmv # best in ~/.zshrc
zmv -n '(**/)(*)(#qD)' '$1${2//[^][:alnum:]_.@+,#!?:&%~()[\/ -]/?}'

Each byte from any sequence of bytes that don't form valid characters in your locale will also be turned into a ? (where rename above would die with a cannot decode error).

Its [[:alnum:]] uses your locale's alnum category so is unlikely to include other Unicode connector punctuation or marks characters.

In both perl and zsh (but often not in other tools), ranges like [a-zÀ-ÿ] are based on the codepoint of the characters. For instance azÀÿ are \u0061\u007A\u00C0\u00FF so, that range would match the abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ characters in that range of code points (which includes non-alphabetic characters and not all characters in the Latin script or used in the French language like œ). In perl, you'd also need to add a use utf8 to be able to use the UTF-8 encoding of À and ÿ in the perl code.

Related Solutions

Shell – Bulk Rename Files with Special Characters

I guess you see this � invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.

Try LC_CTYPE=en_US.iso88591 ls to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE locale setting matters here.

In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:

grep-invalid-utf8 () {
  perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8

You can check if they make more sense in another locale with recode or iconv:

find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8

Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is

find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
        $_=encode("utf8", $_)'

This uses the perl rename command available on Debian and Ubuntu. You can pass it -n to show what it would be doing without actually renaming the files.

Recursively rename files and directories

With find:

find . -type f -exec sh -c 'SHELL COMMAND' {} \;

This invokes SHELL COMMAND on each found file in turn; the file name is "$0". Thus:

find . -type f -exec sh -c '
    mv "$0" "${0%/*}/$(printf "%s\n" "${0##*/}" | sha1sum | cut -d" " -f1)"
' {} \;

(Note the use of printf rather than echo, in case you have a file called -e or -n or a few other problematic cases that echo mangles.)

You can make this a little faster by invoking the shell in batches.

find . -type f -exec sh -c 'for x; do
      mv "$x" "${x%/*}/$(printf "%s\n" "${x##*/}" | sha1sum | cut -d" " -f1)";
    done' _ {} +

In zsh, there's an easy way to match all the files in the current directory and its subdirectories recursively. The . glob qualifier restricts the matches to regular files, and D includes dot files.

for x in **/*(.D); do mv …; done

In bash ≥4, you can run shopt -s globstar and use **/* to match all files in the current directory and its subdirectories recursively. You'll need to filter regular files in the loop.

shopt -s globstar; GLOBIGNORE=".:.."
for x in **/*; do if [[ -f $x ]]; then mv …; done

Best Answer

Related Solutions

Shell – Bulk Rename Files with Special Characters

Recursively rename files and directories

Related Question