Bash – Why is this find command not returning filenames containing non-ASCII characters only

bashcharacter encodingfilesfindunicode

I'm trying to determine the root cause of why this find command is not working; it shouldn't match the file called this_should_not_match below:

$ > find . -type f -name "*[^ -~]*"
./__º╚t
./this_should_not_match
./__╞_u
./__¡VW
./__▀√Z
./__εè_
./__∙Σ_
./__Σ_9
./__Σhm
./__φY_

My shell is Bash 3.2

Best Answer

Ranges only work reliably and portably in the C locale. In other locales, you get some variation, but generally [x-y] gets you the characters (actually collating elements, it could even match sequences of characters) that sort after x and before y in some sort order which is often obscure and not always the same as sort would use.

In the C locale (see What does “LC_ALL=C” do?), characters are bytes and ranges are based on the code point of the characters (on byte values).

LC_ALL=C find . -type f -name "*[^ -~]*"

on an ASCII-based system (most of them; POSIX doesn't guarantee the C locale to use ASCII charset, but in practice, unless you're on some EBCDIC based special IBM mainframe OS (but then you'd know about it), you'll be using ASCII) would list regular files whose name contains bytes other than those between 32 and 126.

Also note that in a multi-byte character locale (like UTF-8 ones, the norm nowadays), the * may not even match all file names as on some systems, it will fail to match sequences of bytes that don't form valid characters.

Related Question