I guess you see this �
invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.
Try LC_CTYPE=en_US.iso88591 ls
to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE
locale setting matters here.
In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:
grep-invalid-utf8 () {
perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8
You can check if they make more sense in another locale with recode or iconv:
find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8
Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is
find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
$_=encode("utf8", $_)'
This uses the perl rename command available on Debian and Ubuntu. You can pass it -n
to show what it would be doing without actually renaming the files.
The file has a name, but it's made of non-printable characters. If you use ksh93, bash, zsh, mksh or FreeBSD sh, you can try to remove it by specifying its non-printable name. First ensure that the name is right with: ls -ld $'\177'
If it shows the right file, then use rm: rm $'\177'
Another (a bit more risky) approach is to use rm -i -- *
. With the -i option rm requires confirmation before removing a file, so you can skip all files you want to keep but the one.
Good luck!
Best Answer
Assuming that "foreign" means "not an ASCII character", then you can use
find
with a pattern to find all files not having printable ASCII characters in their names:(The space is the first printable character listed on http://www.asciitable.com/,
~
is the last.)The hint for
LC_ALL=C
is required (actually,LC_CTYPE=C
andLC_COLLATE=C
), otherwise the character range is interpreted incorrectly. See also the manual pageglob(7)
. SinceLC_ALL=C
causesfind
to interpret strings as ASCII, it will print multi-byte characters (such asπ
) as question marks. To fix this, pipe to some program (e.g.cat
) or redirect to file.Instead of specifying character ranges,
[:print:]
can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.Example: