macOS – How to Rename Filenames with Accents

locateosxregular expressionrenameunicode

I am trying to rename files that include the character "à".

I do the following :

rename -v 's/à/a/g' *

But it shows all the files as unchanged. Verbose mode shows the same thing.

I tried to escape with \ but with no luck.

How can I make the regex match this type of character ?

EDIT

The output of perl -V :

Summary of my perl5 (revision 5 version 18 subversion 2) configuration:

  Platform:
    osname=darwin, osvers=16.0, archname=darwin-thread-multi-2level
    uname='darwin osx320.apple.com 16.0 darwin kernel version 15.0.0: wed jun 22 17:57:08 pdt 2016; root:xnu-3247.1.106.2.9~1development_x86_64 x86_64 '
    config_args='-ds -e -Dprefix=/usr -Dccflags=-g  -pipe  -Dldflags= -Dman3ext=3pm -Duseithreads -Duseshrplib -Dinc_version_list=none -Dcc=cc'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-arch x86_64 -arch i386 -g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -fstack-protector',
    optimize='-Os',
    cppflags='-g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -fstack-protector'
    ccversion='', gccversion='4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc -mmacosx-version-min=10.12.5', ldflags ='-arch x86_64 -arch i386 -fstack-protector'
    libpth=/usr/lib /usr/local/lib
    libs= 
    perllibs=
    libc=, so=dylib, useshrplib=true, libperl=libperl.dylib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-arch x86_64 -arch i386 -bundle -undefined dynamic_lookup -fstack-protector'


Characteristics of this binary (from libperl): 
  Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
                        PERL_DONT_CREATE_GVSV
                        PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
                        PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
                        USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
                        USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
                        USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
                        USE_REENTRANT_API
  Locally applied patches:
    /Library/Perl/Updates/<version> comes before system perl directories
    installprivlib and installarchlib points to the Updates directory
  Built under darwin
  Compiled at Feb  6 2017 22:16:22
  @INC:
    /Library/Perl/5.18/darwin-thread-multi-2level
    /Library/Perl/5.18
    /Network/Library/Perl/5.18/darwin-thread-multi-2level
    /Network/Library/Perl/5.18
    /Library/Perl/Updates/5.18.2
    /System/Library/Perl/5.18/darwin-thread-multi-2level
    /System/Library/Perl/5.18
    /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level
    /System/Library/Perl/Extras/5.18
    .

EDIT 2 :

Output of locale :

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

SOLUTION

Here's in a nutshell what worked. All the 3 solution did the job :

  1. rename -nv $'s/a\xcc\x80/a/g' *
  2. PERL_UNICODE=AS rename -n 's/\pM//g' ./*. (see explanations in chosen answer)
  3. Switching to zsh, instead of the default Shell of MacOS (bash), then my original command (without any need for specifying combining characters such as a\u300) worked : rename -v 's/à/a/g' *.

If you're not satisfied with either of these solutions, please look at the chosen answer to find useful tips.

Best Answer

On macOS and with the HFS+ file system at least, accented characters are encoded in their decomposed form so à is encoded as a\u300 (a followed by the combining grave accent combining character) even if you created the file with touch $'\ue0' (the pre-composed form (stand-alone a with grave accent), causing all sorts of bugs (and subject of one of Linus Torvald's famous rants) like for its pseudo-case insensitiveness.

You'll notice that if you do:

touch à; echo ?

to list the file names made of one character, it returns nothing while:

echo ??

or

echo *a*

Does return that à (actually ). And:

$ echo ?? | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING GRAVE ACCENT}\N{<control-000A>}

So you'd need:

rename $'s/a\u300/a/g' ./*

(assuming zsh or compatible shell). Or using specifying the UTF-8 encoding of that U+0300 character (0xcc 0x80) by hand, for shells that support the ksh93 $'...' quotes but not zsh's $'\u300' (like the ancient version of bash found on macOS):

rename $'s/a\xcc\x80/a/g' ./*

Or let perl interpret those \xcc\x80 sequences directly:

rename 's/a\xcc\x80/a/g' ./*

Or the unicode character:

PERL_UNICODE=AS rename 's/\x{300}//' ./*

Or remove all combining characters with:

PERL_UNICODE=AS rename -n 's/\pM//g' ./*

There, we're telling perl to consider Arguments and Stdio streams are encoded in UTF-8 (see perldoc perlrun for a description of the $PERL_UNICODE env var equivalent to the -C option) and remove all the characters that have the Mark Unicode property (\pM is short for \p{Mark} or \p{Combining_Mark}, see perldoc perluniprops for details)

Note that you should be able to list that file (in zsh) both with:

ls -d $'a\u300'

and:

ls -d $'\ue0'

(and $'A\u300' and possibly $'\uc0 for À as it's meant to be case insensitive), but:

ls -d *A*

and in shells other than zsh:

ls -d *$'\ue0'*
ls -d *$'\xc3\xa0'*

won't match it, because the shell lists the content of the current directory and applies the pattern against each file name and the file name is encoded as a\u300 which won't match.

On zsh however and on macOS only, the shell internally converts those letters with combining accents to their precomposed form upon readdir() as if passing them through iconv -f UTF-8-MAC -t UTF-8. Its own internal zreaddir() wrapper around readdir() does return U+00E0 instead of aU+0300 which explains why echo *à* works there (and not echo *a*) and not elsewhere.

The change was introduced in June 2014. See the discussion on the zsh mailing list for more details.

The core of the problem is the discrepancy between the encoding used on user input and the one used to store (and list) file names in the file system. The problem is a lot worse in Korean where virtually every character has a precomposed and decomposed form, which explains why the zsh issue was raised by a Korean person initially.

So zsh basically fixes Apple's poor choice of decomposed form in the file system so its completion and globs can be used, but unfortunately, that only applies to zsh, ls | grep à or find . -name '*à*' still won't work.

Related Question