Command-Line – Rename Files and Directories with French Characters

command lineregular expressionrename

I use the following command on Ubuntu with rename (installed with sudo apt-get rename) to rename all files which have the given characters in regex:

find . -execdir rename 's/[^A-Za-z0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;

This is working very well and all other characters are changed to ?. Now I want to include French characters like àèìòù and so on. So I added À-ÿ to my regex:

find . -execdir rename 's/[^A-Za-zÀ-ÿ0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;

But somehow the files are not getting renamed and they seem to be corrupted after running this command with À-ÿ because I can't delete them anymore.

What is the right way to include them in the rename regex?

Best Answer

Assuming those file names are encoded in UTF-8, use:

find . -depth -execdir rename -n '
  utf8::decode$_ or die "cannot decode $_\n";
  s{[^\w.\@+,#!?:&%~()\[\]/ -]}{?}gs;
  utf8::encode$_;
  ' {} +

(remove the -n when happy).

Beware that some BSD implementations of find do not prefix the file names with ./ with -execdir so that command could fail if there are file names that start with -. With your variant of rename, you should be able to work around it by changing rename -n to rename -n -- (that doesn't work will all other perl rename variants).

In modern versions of perl, \w (for word character) is any alphanumeric (in any alphabetic script, not just Latin), or underscore character plus other connector punctuation chararcters plus Unicode marks (so for instance, includes the combining acute accent character that follows e in the decomposed form of é).

If you wanted to be more restrictive, instead of \w, you could use \p{latin}\p{mark}0-9_ to only include letters in the Latin script (and not Cyrillic, Greek...), the combining diacritics (though not limited to those typically used with Latin letters), and only the Hindu–Arabic decimal digits (and not other kinds of digits) and underscore (and not other connector punctuation characters).

If you don't use utf8::decode, perl will assume the characters are encoded in the iso8859-1 unibyte character set (for instance where 0xc3 0xa9 (the UTF-8 encoding of the pre-composed form of é) is à ©).

Alternatively, you can use zsh (which will decode characters as per the locale's encoding (see the output of locale charmap)):

autoload zmv # best in ~/.zshrc
zmv -n '(**/)(*)(#qD)' '$1${2//[^][:alnum:]_.@+,#!?:&%~()[\/ -]/?}'

Each byte from any sequence of bytes that don't form valid characters in your locale will also be turned into a ? (where rename above would die with a cannot decode error).

Its [[:alnum:]] uses your locale's alnum category so is unlikely to include other Unicode connector punctuation or marks characters.

In both perl and zsh (but often not in other tools), ranges like [a-zÀ-ÿ] are based on the codepoint of the characters. For instance azÀÿ are \u0061\u007A\u00C0\u00FF so, that range would match the abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ characters in that range of code points (which includes non-alphabetic characters and not all characters in the Latin script or used in the French language like œ). In perl, you'd also need to add a use utf8 to be able to use the UTF-8 encoding of À and ÿ in the perl code.

Related Question