Bash – Identify Files with Non-ASCII or Non-Printable Characters

bashcharacter encodingfilenamesfindshell

In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there:

  • An easy way to list or otherwise identify these file names?
  • A way to generate printable non-English language characters – those characters that are not listed in the printable range of man ascii (so I can test that these files are being identified)?

Best Answer

Assuming that "foreign" means "not an ASCII character", then you can use find with a pattern to find all files not having printable ASCII characters in their names:

LC_ALL=C find . -name '*[! -~]*'

(The space is the first printable character listed on http://www.asciitable.com/, ~ is the last.)

The hint for LC_ALL=C is required (actually, LC_CTYPE=C and LC_COLLATE=C), otherwise the character range is interpreted incorrectly. See also the manual page glob(7). Since LC_ALL=C causes find to interpret strings as ASCII, it will print multi-byte characters (such as π) as question marks. To fix this, pipe to some program (e.g. cat) or redirect to file.

Instead of specifying character ranges, [:print:] can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.

Example:

$ touch $(printf '\u03c0') "$(printf 'x\ty')"
$ ls -F
dir/  foo  foo.c  xrestop-0.4/  xrestop-0.4.tar.gz  π
$ find -name '*[! -~]*'       # this is broken (LC_COLLATE=en_US.UTF-8)
./x?y
./dir
./π
... (a lot more)
./foo.c
$ LC_ALL=C find . -name '*[! -~]*'
./x?y
./??
$ LC_ALL=C find . -name '*[! -~]*' | cat
./x y
./π
$ LC_ALL=C find . -name '*[![:print:]]*' | cat
./x y
./π