I guess you see this �
invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.
Try LC_CTYPE=en_US.iso88591 ls
to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE
locale setting matters here.
In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:
grep-invalid-utf8 () {
perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8
You can check if they make more sense in another locale with recode or iconv:
find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8
Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is
find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
$_=encode("utf8", $_)'
This uses the perl rename command available on Debian and Ubuntu. You can pass it -n
to show what it would be doing without actually renaming the files.
I have reformulated your questions a bit, for reasons that should
appear evident when you read them in sequence.
1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?
No, this is not possible: as you mention in your question, a UNIX file
name is just a sequence of bytes; the kernel knows nothing about
the encoding, which entirely a user-space (i.e., application-level)
concept.
In other words, the kernel knows nothing about LANG
/LC_*
, so it cannot
translate.
2. Is it possible to let different file names refer to same file?
You can have multiple directory entries referring to the same file;
you can make that through hard links or symbolic links.
Be aware, however, that the file names that are not valid in the
current encoding (e.g., your GBK character string when you're working
in a UTF-8 locale) will display badly, if at all.
3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?
You cannot patch the kernel to do this (see 1.), but you could -in
theory- patch the C library (e.g., glibc) to perform this translation,
and always convert file names to UTF-8 when it calls the kernel, and
convert them back to the current encoding when it reads a file name
from the kernel.
A simpler approach could be to write an overlay filesystem with FUSE,
that just redirects any filesystem request to another location after
converting the file name to/from UTF-8. Ideally you could mount this
filesystem in ~/trans
, and when an access is made to
~/trans/a/GBK/encoded/path
then the FUSE filesystem really accesses
/a/UTF-8/encoded/path
.
However, the problem with these approaches is: what do you do with
files that already exist on your filesystem and are not UTF-8 encoded?
You cannot just simply pass them untranslated, because then you don't
know how to convert them; you cannot mangle them by translating
invalid character sequences to ?
because that could create
conflicts...
Best Answer
While glibc defines
#define FILENAME_MAX 4096
on Linux which limits path length to 4096 bytes there's a hard 255 bytes limit in Linux VFS which all filesystems must conform to. The said limit is defined in/usr/include/linux/limits.h
:And here's a piece of code from
linux/fs/libfs.c
which will throw an error in case you dare use a filename length longer than 255 chars:So, not only you'll have to redefine this limit, you'll have to rewrite filesystems source code (and disk structure) to be able to use it. And then outside of your device, you won't be able to mount such a filesystem unless you use its extensions to store very long filenames (like FAT32 does).
TLDR: there's a way but unless you're a kernel hacker/know C very well, there's no way.