Shell – find(1): how is the star wildcard implemented for it to fail on some filenames

character encodingfilenamesfindshellwildcards

In a file system where filenames are in UTF-8, I have a file with a faulty name; it is displayed as: D�sinstaller, actual name according to zsh: D$'\351'sinstaller, Latin1 for Désinstaller, itself a French barbarism for "uninstall." Zsh would not match it with [[ $file =~ '^.*$' ]] but would match it with a globbing *—this is the behavior I expect.

Now I still expect to find it when running find . -name '*'—as a matter of fact, I would never expect a filename to fail this test. However, with LANG=en_US.utf8, the file does not show up, and I have to set LANG=C (or en_US, or '') for it to work.

Question: What is the implementation behind, and how could I have predicted that outcome?

Infos: Arch Linux 3.14.37-1-lts, find (GNU findutils) 4.4.2

Best Answer

That's a really nice catch. From a quick look at the source code for GNU find, I would say this boils down to how fnmatch behaves on invalid byte sequences (pred_name_common in pred.c):

b = fnmatch (str, base, flags) == 0;
(...)
return b;

This code tests the return value of fnmatch for equality with 0, but does not check for errors; this results in any errors being reported as "doesn't match".

It has been suggested, many years ago, to change the behavior of this libc function to always return true on the * pattern, even on broken file names, but from what I can tell the idea must have been rejected (see the thread starting at https://sourceware.org/ml/libc-hacker/2002-11/msg00071.html):

When fnmatch detects an invalid multibyte character it should fall back to single byte matching, so that "*" has a chance to match such a string.

And why is this better or more correct? Is there existing practice?

As mentioned by Stéphane Chazelas in a comment, and also in the same 2002 thread, this is inconsistent with the glob expansion performed by shells, which do not choke on invalid characters. Perhaps even more puzzling is the fact that reversing the test will match only those files that have broken names (create files in bash with touch $'D\351marrer' $'Touch\303\251' $'\346\227\245\346\234\254\350\252\236'):

$ find -name '*'
.
./Touché
./日本語

$ find -not -name '*'
./D?marrer

So, to answer your question, you could have predicted this by knowing the behavior of your fnmatch in this case, and knowing how find handles this function's return value; you probably could not have found out solely by reading the documentation.