Short answer: restrictions imposed in Unix/Linux/BSD kernel, namei()
function. Encoding takes place in user level programs like xterm
, firefox
or ls
.
I think you're starting from incorrect premises. A file name in Unix is a string of bytes with arbitrary values. A few values, 0x0 (ASCII Nul) and 0x2f (ASCII '/') are just not allowed, not as part of a multi-byte character encoding, not as anything. A "byte" can contain a number representing a character (in ASCII and some other encodings) but a "character" can require more than 1 byte (for example, code points above 0x7f in UTF-8 representation of Unicode).
These restrictions arise from file name printing conventions and the ASCII character set. The original Unixes used ASCII '/' (numerically 0x2f) valued bytes to separate pieces of a partially- or fully-qualified path (like '/usr/bin/cat' has pieces "usr", "bin" and "cat"). The original Unixes used ASCII Nul to terminate strings. Other than those two values, bytes in file names may assume any other value. You can see an echo of this in the UTF-8 encoding for Unicode. Printable ASCII characters, including '/', take only one byte in UTF-8. UTF-8 for code points above does not include any Zero-valued bytes, except for the Nul control character. UTF-8 was invented for Plan-9, The Pretender to the Throne of Unix.
Older Unixes (and it looks like Linux) had a namei()
function that just looks at paths a byte at a time, and breaks the paths into pieces at 0x2F valued bytes, stopping at a zero-valued byte. namei()
is part of the Unix/Linux/BSD kernel, so that's where the exceptional byte values get enforced.
Notice that so far, I've talked about byte values, not characters. namei()
does not enforce any character semantics on the bytes. That's up to the user-level programs, like ls
, which might sort file names based on byte values, or character values. xterm
decides what pixels to light up for file names based on the character encoding. If you don't tell xterm
you've got UTF-8 encoded filenames, you'll see a lot of gibberish when you invoke it. If vim
isn't compiled to detect UTF-8 (or whatever, UTF-16, UTF-32) encodings, you'll see a lot of gibberish when you open a "text file" containing UTF-8 encoded characters.
The oldest character encoding used in consoles like VT52 was ASCII.
That basic decision has been carried over for many years. Most consoles use ASCII as the most basic character set as defined by ANSI. The next set of encodings (in the west) are the ISO-8859 sets (from 1 to 15). One for each language (language group). Being the most common the ISO-8859-1 (English), and the other in proportion to the corresponding language in use.
Then, the most general list of world characters is Unicode, which, in Linux, is usually encoded in UTF-8.
It is that encoding the most common for present day terminals and programs in Linux.
From more general to particular settings:
OS
The default in debian since Etch on Apr 8th 2007
(13 years ago) has been utf-8.
Note : Fresh Debian/Etch installation have UTF8 enabled by default.
And confirmed on the release notes:
The default encoding for new Debian GNU/Linux installations is UTF-8. A number of applications will also be set up to use UTF-8 by default.
What that means is that Debian (and Ubuntu, Mint, and many other) are utf-8 capable by default.
locale
Which encoding (and country) is actually chosen by the user with the command dpkg-reconfigure locales
is left to user preferences.
That configure the actual particular setting for the computer locale
command.
All of the LC_*
"environment variables" have specific effects on each of country/language sections (parts) as defined by the POSIX spec.
tty
But the above are just "general" settings. A particular terminal may (or may not) match it. Well, in general, the usual encoding for most terminals today is utf8.
The encoding for a particular terminal (tty) may be found if set to utf8 with:
$ stty -a | grep -o '.iutf8'
iutf8
That is, no -
before the result printed.
terminal
But the terminal
(GUI window) inside which the tty terminal is (usually) running also has its own locale setting. If the settings are sane, probably:
$ locale charmap
UTF-8
Will have the correct answer.
But that is just a quick and very shallow look at all the i18n settings of linux/unix.
Take away: Probably, assuming Linux is using utf8 is your best bet.
Best Answer
I have reformulated your questions a bit, for reasons that should appear evident when you read them in sequence.
1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?
No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.
In other words, the kernel knows nothing about
LANG
/LC_*
, so it cannot translate.2. Is it possible to let different file names refer to same file?
You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.
Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.
3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?
You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.
A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in
~/trans
, and when an access is made to~/trans/a/GBK/encoded/path
then the FUSE filesystem really accesses/a/UTF-8/encoded/path
.However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to
?
because that could create conflicts...