In a file system where filenames are in UTF-8, I have a file with a faulty name; it is displayed as: D�sinstaller
, actual name according to zsh: D$'\351'sinstaller
, Latin1 for Désinstaller
, itself a French barbarism for "uninstall." Zsh would not match it with [[ $file =~ '^.*$' ]]
but would match it with a globbing *
—this is the behavior I expect.
Now I still expect to find it when running find . -name '*'
—as a matter of fact, I would never expect a filename to fail this test. However, with LANG=en_US.utf8
, the file does not show up, and I have to set LANG=C
(or en_US
, or ''
) for it to work.
Question: What is the implementation behind, and how could I have predicted that outcome?
Infos: Arch Linux 3.14.37-1-lts, find (GNU findutils) 4.4.2
Best Answer
That's a really nice catch. From a quick look at the source code for GNU find, I would say this boils down to how
fnmatch
behaves on invalid byte sequences (pred_name_common
inpred.c
):This code tests the return value of
fnmatch
for equality with 0, but does not check for errors; this results in any errors being reported as "doesn't match".It has been suggested, many years ago, to change the behavior of this libc function to always return true on the
*
pattern, even on broken file names, but from what I can tell the idea must have been rejected (see the thread starting at https://sourceware.org/ml/libc-hacker/2002-11/msg00071.html):As mentioned by Stéphane Chazelas in a comment, and also in the same 2002 thread, this is inconsistent with the glob expansion performed by shells, which do not choke on invalid characters. Perhaps even more puzzling is the fact that reversing the test will match only those files that have broken names (create files in bash with
touch $'D\351marrer' $'Touch\303\251' $'\346\227\245\346\234\254\350\252\236'
):So, to answer your question, you could have predicted this by knowing the behavior of your
fnmatch
in this case, and knowing howfind
handles this function's return value; you probably could not have found out solely by reading the documentation.