I have reformulated your questions a bit, for reasons that should
appear evident when you read them in sequence.
1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?
No, this is not possible: as you mention in your question, a UNIX file
name is just a sequence of bytes; the kernel knows nothing about
the encoding, which entirely a user-space (i.e., application-level)
concept.
In other words, the kernel knows nothing about LANG
/LC_*
, so it cannot
translate.
2. Is it possible to let different file names refer to same file?
You can have multiple directory entries referring to the same file;
you can make that through hard links or symbolic links.
Be aware, however, that the file names that are not valid in the
current encoding (e.g., your GBK character string when you're working
in a UTF-8 locale) will display badly, if at all.
3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?
You cannot patch the kernel to do this (see 1.), but you could -in
theory- patch the C library (e.g., glibc) to perform this translation,
and always convert file names to UTF-8 when it calls the kernel, and
convert them back to the current encoding when it reads a file name
from the kernel.
A simpler approach could be to write an overlay filesystem with FUSE,
that just redirects any filesystem request to another location after
converting the file name to/from UTF-8. Ideally you could mount this
filesystem in ~/trans
, and when an access is made to
~/trans/a/GBK/encoded/path
then the FUSE filesystem really accesses
/a/UTF-8/encoded/path
.
However, the problem with these approaches is: what do you do with
files that already exist on your filesystem and are not UTF-8 encoded?
You cannot just simply pass them untranslated, because then you don't
know how to convert them; you cannot mangle them by translating
invalid character sequences to ?
because that could create
conflicts...
Short answer: restrictions imposed in Unix/Linux/BSD kernel, namei()
function. Encoding takes place in user level programs like xterm
, firefox
or ls
.
I think you're starting from incorrect premises. A file name in Unix is a string of bytes with arbitrary values. A few values, 0x0 (ASCII Nul) and 0x2f (ASCII '/') are just not allowed, not as part of a multi-byte character encoding, not as anything. A "byte" can contain a number representing a character (in ASCII and some other encodings) but a "character" can require more than 1 byte (for example, code points above 0x7f in UTF-8 representation of Unicode).
These restrictions arise from file name printing conventions and the ASCII character set. The original Unixes used ASCII '/' (numerically 0x2f) valued bytes to separate pieces of a partially- or fully-qualified path (like '/usr/bin/cat' has pieces "usr", "bin" and "cat"). The original Unixes used ASCII Nul to terminate strings. Other than those two values, bytes in file names may assume any other value. You can see an echo of this in the UTF-8 encoding for Unicode. Printable ASCII characters, including '/', take only one byte in UTF-8. UTF-8 for code points above does not include any Zero-valued bytes, except for the Nul control character. UTF-8 was invented for Plan-9, The Pretender to the Throne of Unix.
Older Unixes (and it looks like Linux) had a namei()
function that just looks at paths a byte at a time, and breaks the paths into pieces at 0x2F valued bytes, stopping at a zero-valued byte. namei()
is part of the Unix/Linux/BSD kernel, so that's where the exceptional byte values get enforced.
Notice that so far, I've talked about byte values, not characters. namei()
does not enforce any character semantics on the bytes. That's up to the user-level programs, like ls
, which might sort file names based on byte values, or character values. xterm
decides what pixels to light up for file names based on the character encoding. If you don't tell xterm
you've got UTF-8 encoded filenames, you'll see a lot of gibberish when you invoke it. If vim
isn't compiled to detect UTF-8 (or whatever, UTF-16, UTF-32) encodings, you'll see a lot of gibberish when you open a "text file" containing UTF-8 encoded characters.
Best Answer
Everything has probably actually worked as you expected, but you have been misled by two different phenomena:
ls
and some other utilities display?
for non-printable characters by default?
is a glob that matches a single characterThat results in the following behaviour:
So it seems like we have a file that is literally named
?
, right? Actually, that's just a happy coincidence of the two above phenomena -- the file is still called^G
:Probably everything worked as you expected, despite what immediately appeared to indicate that the file was literally called
?
.