Bash – Identify Files with Non-ASCII or Non-Printable Characters

bashcharacter encodingfilenamesfindshell

In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there:

An easy way to list or otherwise identify these file names?
A way to generate printable non-English language characters – those characters that are not listed in the printable range of man ascii (so I can test that these files are being identified)?

Best Answer

Assuming that "foreign" means "not an ASCII character", then you can use find with a pattern to find all files not having printable ASCII characters in their names:

LC_ALL=C find . -name '*[! -~]*'

(The space is the first printable character listed on http://www.asciitable.com/, ~ is the last.)

The hint for LC_ALL=C is required (actually, LC_CTYPE=C and LC_COLLATE=C), otherwise the character range is interpreted incorrectly. See also the manual page glob(7). Since LC_ALL=C causes find to interpret strings as ASCII, it will print multi-byte characters (such as π) as question marks. To fix this, pipe to some program (e.g. cat) or redirect to file.

Instead of specifying character ranges, [:print:] can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.

Example:

$ touch $(printf '\u03c0') "$(printf 'x\ty')"
$ ls -F
dir/  foo  foo.c  xrestop-0.4/  xrestop-0.4.tar.gz  π
$ find -name '*[! -~]*'       # this is broken (LC_COLLATE=en_US.UTF-8)
./x?y
./dir
./π
... (a lot more)
./foo.c
$ LC_ALL=C find . -name '*[! -~]*'
./x?y
./??
$ LC_ALL=C find . -name '*[! -~]*' | cat
./x y
./π
$ LC_ALL=C find . -name '*[![:print:]]*' | cat
./x y
./π

Related Solutions

Shell – Bulk Rename Files with Special Characters

I guess you see this � invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.

Try LC_CTYPE=en_US.iso88591 ls to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE locale setting matters here.

In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:

grep-invalid-utf8 () {
  perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8

You can check if they make more sense in another locale with recode or iconv:

find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8

Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is

find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
        $_=encode("utf8", $_)'

This uses the perl rename command available on Debian and Ubuntu. You can pass it -n to show what it would be doing without actually renaming the files.

File Deletion – How to Delete a File with Non-Printing Characters in Filename

The file has a name, but it's made of non-printable characters. If you use ksh93, bash, zsh, mksh or FreeBSD sh, you can try to remove it by specifying its non-printable name. First ensure that the name is right with: ls -ld $'\177' If it shows the right file, then use rm: rm $'\177'

Another (a bit more risky) approach is to use rm -i -- * . With the -i option rm requires confirmation before removing a file, so you can skip all files you want to keep but the one.

Good luck!

Best Answer

Related Solutions

Shell – Bulk Rename Files with Special Characters

File Deletion – How to Delete a File with Non-Printing Characters in Filename

Related Question