Shell – find(1): how is the star wildcard implemented for it to fail on some filenames

character encodingfilenamesfindshellwildcards

In a file system where filenames are in UTF-8, I have a file with a faulty name; it is displayed as: D�sinstaller, actual name according to zsh: D$'\351'sinstaller, Latin1 for Désinstaller, itself a French barbarism for "uninstall." Zsh would not match it with [[ $file =~ '^.*$' ]] but would match it with a globbing *—this is the behavior I expect.

Now I still expect to find it when running find . -name '*'—as a matter of fact, I would never expect a filename to fail this test. However, with LANG=en_US.utf8, the file does not show up, and I have to set LANG=C (or en_US, or '') for it to work.

Question: What is the implementation behind, and how could I have predicted that outcome?

_{Infos: Arch Linux 3.14.37-1-lts, find (GNU findutils) 4.4.2}

Best Answer

That's a really nice catch. From a quick look at the source code for GNU find, I would say this boils down to how fnmatch behaves on invalid byte sequences (pred_name_common in pred.c):

b = fnmatch (str, base, flags) == 0;
(...)
return b;

This code tests the return value of fnmatch for equality with 0, but does not check for errors; this results in any errors being reported as "doesn't match".

It has been suggested, many years ago, to change the behavior of this libc function to always return true on the * pattern, even on broken file names, but from what I can tell the idea must have been rejected (see the thread starting at https://sourceware.org/ml/libc-hacker/2002-11/msg00071.html):

When fnmatch detects an invalid multibyte character it should fall back to single byte matching, so that "*" has a chance to match such a string.

And why is this better or more correct? Is there existing practice?

As mentioned by Stéphane Chazelas in a comment, and also in the same 2002 thread, this is inconsistent with the glob expansion performed by shells, which do not choke on invalid characters. Perhaps even more puzzling is the fact that reversing the test will match only those files that have broken names (create files in bash with touch $'D\351marrer' $'Touch\303\251' $'\346\227\245\346\234\254\350\252\236'):

$ find -name '*'
.
./Touché
./日本語

$ find -not -name '*'
./D?marrer

So, to answer your question, you could have predicted this by knowing the behavior of your fnmatch in this case, and knowing how find handles this function's return value; you probably could not have found out solely by reading the documentation.

Related Solutions

Shell – GNU find and masking the {} for some shells – which

Summary: If there ever was a shell that expanded {}, it's really old legacy stuff by now.

In the Bourne shell and in POSIX-compliant shells, braces ({ and }) are ordinary characters (unlike ( and ) which are word delimiters like ; and &, and [ and ] which are globbing characters). The following strings are all supposed to be printed literally:

$ echo { } {} {foo,bar} {1..3}
{ } {} {foo,bar} {1..3}

A word consisting of a single brace is a reserved word, which is only special if it is the first word of a command.

Ksh implements brace expansion as an incompatible extension to the Bourne shell. This can be turned off with set +B. Bash emulates ksh in this respect. Zsh implements brace expansion as well; there it can be turned off with set +I or setopt ignore_braces or emulate sh. None of these shells expand {} in any case, even when it's a substring of a word (e.g. foo{}bar), due to the common use in arguments to find and xargs.

Single Unix v2 notes that

In some historical systems, the curly braces are treated as control operators. To assist in future standardisation activities, portable applications should avoid using unquoted braces to represent the characters themselves. It is possible that a future version of the ISO/IEC 9945-2:1993 standard may require that { and } be treated individually as control operators, although the token {} will probably be a special-case exemption from this because of the often-used find {} construct.

This note was dropped in subsequent versions of the standard; the examples for find have unquoted uses of {}, as do the examples for xargs. There may have been historical Bourne shells where {} had to be quoted, but they would be really old legacy systems by now.

The csh implementations I have at hand (OpenBSD 4.7, BSD csh on Debian, tcsh) all expand {foo} to foo but leave {} alone.

Opendir and readdir encoding strings behind the back

opendir and readdir themselves work on bytes. They do not perform and reencoding.

Some filesystem drivers may impose contraints on the byte sequences. For example, HFS+ normalizes file names using a proprietary Unicode normalization scheme. I would expect the form returned by readdir to work when passed to opendir, however, so like the OP in the Ubuntu forum thread that jw013 mentioned, I suspect a bug in the HFS+ driver. It is not the only program that is tripped by Hangul on HFS+. Even OSX seems to have trouble with Unicode normalization.

Best Answer

Related Solutions

Shell – GNU find and masking the {} for some shells – which

Opendir and readdir encoding strings behind the back

Related Question