I am trying to rename files that include the character "à".
I do the following :
rename -v 's/à/a/g' *
But it shows all the files as unchanged. Verbose mode shows the same thing.
I tried to escape with \
but with no luck.
How can I make the regex match this type of character ?
EDIT
The output of perl -V
:
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
Platform:
osname=darwin, osvers=16.0, archname=darwin-thread-multi-2level
uname='darwin osx320.apple.com 16.0 darwin kernel version 15.0.0: wed jun 22 17:57:08 pdt 2016; root:xnu-3247.1.106.2.9~1development_x86_64 x86_64 '
config_args='-ds -e -Dprefix=/usr -Dccflags=-g -pipe -Dldflags= -Dman3ext=3pm -Duseithreads -Duseshrplib -Dinc_version_list=none -Dcc=cc'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-arch x86_64 -arch i386 -g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -fstack-protector',
optimize='-Os',
cppflags='-g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -fstack-protector'
ccversion='', gccversion='4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='cc -mmacosx-version-min=10.12.5', ldflags ='-arch x86_64 -arch i386 -fstack-protector'
libpth=/usr/lib /usr/local/lib
libs=
perllibs=
libc=, so=dylib, useshrplib=true, libperl=libperl.dylib
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags='-arch x86_64 -arch i386 -bundle -undefined dynamic_lookup -fstack-protector'
Characteristics of this binary (from libperl):
Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
USE_REENTRANT_API
Locally applied patches:
/Library/Perl/Updates/<version> comes before system perl directories
installprivlib and installarchlib points to the Updates directory
Built under darwin
Compiled at Feb 6 2017 22:16:22
@INC:
/Library/Perl/5.18/darwin-thread-multi-2level
/Library/Perl/5.18
/Network/Library/Perl/5.18/darwin-thread-multi-2level
/Network/Library/Perl/5.18
/Library/Perl/Updates/5.18.2
/System/Library/Perl/5.18/darwin-thread-multi-2level
/System/Library/Perl/5.18
/System/Library/Perl/Extras/5.18/darwin-thread-multi-2level
/System/Library/Perl/Extras/5.18
.
EDIT 2 :
Output of locale
:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
SOLUTION
Here's in a nutshell what worked. All the 3 solution did the job :
rename -nv $'s/a\xcc\x80/a/g' *
PERL_UNICODE=AS rename -n 's/\pM//g' ./*
. (see explanations in chosen answer)- Switching to
zsh
, instead of the default Shell of MacOS (bash
), then my original command (without any need for specifying combining characters such asa\u300
) worked :rename -v 's/à/a/g' *
.
If you're not satisfied with either of these solutions, please look at the chosen answer to find useful tips.
Best Answer
On macOS and with the HFS+ file system at least, accented characters are encoded in their decomposed form so
à
is encoded asa\u300
(a
followed by the combining grave accent combining character) even if you created the file withtouch $'\ue0'
(the pre-composed form (stand-alonea
with grave accent), causing all sorts of bugs (and subject of one of Linus Torvald's famous rants) like for its pseudo-case insensitiveness.You'll notice that if you do:
to list the file names made of one character, it returns nothing while:
or
Does return that
à
(actuallyà
). And:So you'd need:
(assuming
zsh
or compatible shell). Or using specifying the UTF-8 encoding of that U+0300 character (0xcc 0x80) by hand, for shells that support the ksh93$'...'
quotes but notzsh
's$'\u300'
(like the ancient version ofbash
found on macOS):Or let
perl
interpret those\xcc\x80
sequences directly:Or the unicode character:
Or remove all combining characters with:
There, we're telling
perl
to considerA
rguments andS
tdio streams are encoded in UTF-8 (seeperldoc perlrun
for a description of the$PERL_UNICODE
env var equivalent to the-C
option) and remove all the characters that have theM
ark Unicodep
roperty (\pM
is short for\p{Mark}
or\p{Combining_Mark}
, seeperldoc perluniprops
for details)Note that you should be able to list that file (in
zsh
) both with:and:
(and
$'A\u300' and possibly $'\uc0
forÀ
as it's meant to be case insensitive), but:and in shells other than
zsh
:won't match it, because the shell lists the content of the current directory and applies the pattern against each file name and the file name is encoded as
a\u300
which won't match.On
zsh
however and on macOS only, the shell internally converts those letters with combining accents to their precomposed form uponreaddir()
as if passing them throughiconv -f UTF-8-MAC -t UTF-8
. Its own internalzreaddir()
wrapper aroundreaddir()
does return U+00E0 instead ofaU+0300
which explains whyecho *à*
works there (and notecho *a*
) and not elsewhere.The change was introduced in June 2014. See the discussion on the zsh mailing list for more details.
The core of the problem is the discrepancy between the encoding used on user input and the one used to store (and list) file names in the file system. The problem is a lot worse in Korean where virtually every character has a precomposed and decomposed form, which explains why the zsh issue was raised by a Korean person initially.
So
zsh
basically fixes Apple's poor choice of decomposed form in the file system so its completion and globs can be used, but unfortunately, that only applies tozsh
,ls | grep à
orfind . -name '*à*'
still won't work.