I use the following command on Ubuntu with rename
(installed with sudo apt-get rename
) to rename all files which have the given characters in regex:
find . -execdir rename 's/[^A-Za-z0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;
This is working very well and all other characters are changed to ?
. Now I want to include French characters like àèìòù
and so on. So I added À-ÿ
to my regex:
find . -execdir rename 's/[^A-Za-zÀ-ÿ0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;
But somehow the files are not getting renamed and they seem to be corrupted after running this command with À-ÿ
because I can't delete them anymore.
What is the right way to include them in the rename regex?
Best Answer
Assuming those file names are encoded in UTF-8, use:
(remove the
-n
when happy).Beware that some BSD implementations of
find
do not prefix the file names with./
with-execdir
so that command could fail if there are file names that start with-
. With your variant ofrename
, you should be able to work around it by changingrename -n
torename -n --
(that doesn't work will all other perlrename
variants).In modern versions of
perl
,\w
(for word character) is any alphanumeric (in any alphabetic script, not just Latin), or underscore character plus other connector punctuation chararcters plus Unicode marks (so for instance, includes the combining acute accent character that followse
in the decomposed form ofé
).If you wanted to be more restrictive, instead of
\w
, you could use\p{latin}\p{mark}0-9_
to only include letters in the Latin script (and not Cyrillic, Greek...), the combining diacritics (though not limited to those typically used with Latin letters), and only the Hindu–Arabic decimal digits (and not other kinds of digits) and underscore (and not other connector punctuation characters).If you don't use
utf8::decode
,perl
will assume the characters are encoded in the iso8859-1 unibyte character set (for instance where0xc3 0xa9
(the UTF-8 encoding of the pre-composed form ofé
) isÃ
©
).Alternatively, you can use
zsh
(which will decode characters as per the locale's encoding (see the output oflocale charmap
)):Each byte from any sequence of bytes that don't form valid characters in your locale will also be turned into a
?
(whererename
above would die with acannot decode
error).Its
[[:alnum:]]
uses your locale'salnum
category so is unlikely to include other Unicode connector punctuation or marks characters.In both
perl
andzsh
(but often not in other tools), ranges like[a-zÀ-ÿ]
are based on the codepoint of the characters. For instanceazÀÿ
are\u0061\u007A\u00C0\u00FF
so, that range would match theabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
characters in that range of code points (which includes non-alphabetic characters and not all characters in the Latin script or used in the French language likeœ
). Inperl
, you'd also need to add ause utf8
to be able to use the UTF-8 encoding ofÀ
andÿ
in the perl code.