Linux – Using find or grep to locate filenames with accented characters from a different encoding system (Windows to Linux)

bashfindgreplinux

I tried to tag late onto a question similar to mine on stackoverflow (Find Non-UTF8 Filenames on Linux File System) to elicit further replies, with no luck so far, so here goes again…

I have the same problem as the OP in the link above and convmv is a great tool to fix one's own filesystem. My question is therefore academic, but I find it unsatisfactory (in fact I can't believe) that 'find' is not able to find non standard ascii characters.

Is there anyone out there that would know what combination of options to use to find filenames that contain non standard characters on what seems to be a unicode FS, in my case the characters seem to be 8bits extended ascii rather than unicode, the files come from a Windows machine (iso-8859-1) and I regularly need to fetch them. I'd love to see how find and/or grep can do the same as convmv.

Sample files:

> ls
Abc�def ÉÈéèáà-rest everest éverest

> ls -b
Abc\251def  ÉÈéèáà-rest  everest  éverest

First file comes from Windows (or simulated with touch $(printf "Abc\xA9def")).

> find . -regex '.*[^a-zA-Z./].*'
./ÉÈéèáà-rest

> ls | egrep '[^a-zA-Z]'
ÉÈéèáà-rest

Missing almost all of them (the hyphen saved that file, can be seen with coloured grep). Whatever is happening here is not what I would expect: neither find nor grep is able to take an accented letter as being outside the range provided [^a-zA-Z./].

> find . -regex '.*é.*'
./éverest
./ÉÈéèáà-rest

> ls | egrep 'é'
ÉÈéèáà-rest
éverest

> ls | egrep '[é]'
ÉÈéèáà-rest
éverest

> find . -regex '.*[é].*'
./éverest
./ÉÈéèáà-rest

Bizarrely both are able to pick up a standard accent when provided (including in the range). Any find or grep trial with \xA9, \0251 or \o251 fails (no match).

> ls | fgrep e
Abc�def
ÉÈéèáà-rest
everest
éverest

Looking for a non-controversial character shows all files with grep, as I would have expected.

> find . -regex '.*e.*'
./éverest
./ÉÈéèáà-rest
./everest

> find . -name '*e*'
./éverest
./ÉÈéèáà-rest
./everest

find, however, is very discriminatory: even looking up a normal character, it seems to me that it eliminates filenames that contain characters outside the range of acceptable characters for the filesystem's name encoding schema.

As far as I am concerned if the file is in the filesystem, then find should find it, right? But maybe there's a feature I don't know about?

Any insights would be very much appreciated.

Best Answer

The GNU tools appear to have code that causes accented letters to be treated like their base letters when matching a regex character class, if supported by the character encoding. This is intended as a "do what I mean" sort of feature to make writing regexes easier, but in this case it's getting in your way.

Try the following modification to your "find" command line:

LANG=C find . -regex '.*[^a-zA-Z./].*'

This sets the LANG environment variable only in the context of the "find" command. Since the "C" language encoding supports only ASCII, the accented letters will no longer be treated as their base letters, and so will be matched properly by your regex.

Related Question