Linux Filesystems – Questions About Character Encoding

character encodingfilenamesfilesystemslinux

Due to a lot of file exchange works between Windows (GBK encoding) and Linux (UTF-8 encoding), it will encounter character encoding issues easily, such as:

zip/tar files whose name contains chinese characters on Windows system, unzip/untar it in Linux system.
run migrated legacy java web application (designed on Windows system, using GBK encoding in JSP) which write GBK-encoding-named files to disk.
ftp get/put GBK-encoding-named files between Windows FTP server and Linux client.
switch LANG environment in Linux.

The common issue of the previous mentioned are file locating/naming. After googled, I got an article Using Unicode in Linux
https://www.linux.com/news/using-unicode-linux/, it said:

the operating system and many utilities do not realize what characters the bytes in file names represent.

So, it's possible to have two files with same name (SAME when their names are decoded by correct character set, but DIFFERENT in bytes), such as 中文.txt, but in different encoding:

[root@fedora test]# ls
????  中文
[root@fedora test]# ls | iconv -f GBK
中文
涓iconv: illegal input sequence at position 7
[root@fedora test]# ls 中文 && ls $'\xd6\xd0\xce\xc4' | iconv -f gbk
中文
中文

Questions:

Is it possible to config linux filesystem use fixed character encoding (like NTFS use UTF-16 internally) to store file names regardless of LANG/LC_ALL environment?
Or, what I actually want ask is: Is it possible to let file name 中文.txt ($'\xe4\xb8\xad\xe6\x96\x87.txt') in zh_CN.UTF-8 environment and file name 中文.txt ($'\xd6\xd0\xce\xc4.txt') in zh_CN.GBK environment refer to same file?
If it's not configurable, then is it possible to patch kernel to translate character encoding between file-system and current environment (just a question, not request implementation)? and how much performance con effect if it's possible?

Best Answer

I have reformulated your questions a bit, for reasons that should appear evident when you read them in sequence.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

Related Solutions

Unix Filenames – Understanding File Name Encoding

Short answer: restrictions imposed in Unix/Linux/BSD kernel, namei() function. Encoding takes place in user level programs like xterm, firefox or ls.

I think you're starting from incorrect premises. A file name in Unix is a string of bytes with arbitrary values. A few values, 0x0 (ASCII Nul) and 0x2f (ASCII '/') are just not allowed, not as part of a multi-byte character encoding, not as anything. A "byte" can contain a number representing a character (in ASCII and some other encodings) but a "character" can require more than 1 byte (for example, code points above 0x7f in UTF-8 representation of Unicode).

These restrictions arise from file name printing conventions and the ASCII character set. The original Unixes used ASCII '/' (numerically 0x2f) valued bytes to separate pieces of a partially- or fully-qualified path (like '/usr/bin/cat' has pieces "usr", "bin" and "cat"). The original Unixes used ASCII Nul to terminate strings. Other than those two values, bytes in file names may assume any other value. You can see an echo of this in the UTF-8 encoding for Unicode. Printable ASCII characters, including '/', take only one byte in UTF-8. UTF-8 for code points above does not include any Zero-valued bytes, except for the Nul control character. UTF-8 was invented for Plan-9, The Pretender to the Throne of Unix.

Older Unixes (and it looks like Linux) had a namei() function that just looks at paths a byte at a time, and breaks the paths into pieces at 0x2F valued bytes, stopping at a zero-valued byte. namei() is part of the Unix/Linux/BSD kernel, so that's where the exceptional byte values get enforced.

Notice that so far, I've talked about byte values, not characters. namei() does not enforce any character semantics on the bytes. That's up to the user-level programs, like ls, which might sort file names based on byte values, or character values. xterm decides what pixels to light up for file names based on the character encoding. If you don't tell xterm you've got UTF-8 encoded filenames, you'll see a lot of gibberish when you invoke it. If vim isn't compiled to detect UTF-8 (or whatever, UTF-16, UTF-32) encodings, you'll see a lot of gibberish when you open a "text file" containing UTF-8 encoded characters.

Which terminal encodings are default on Linux, and which are most common

The oldest character encoding used in consoles like VT52 was ASCII.

That basic decision has been carried over for many years. Most consoles use ASCII as the most basic character set as defined by ANSI. The next set of encodings (in the west) are the ISO-8859 sets (from 1 to 15). One for each language (language group). Being the most common the ISO-8859-1 (English), and the other in proportion to the corresponding language in use.

Then, the most general list of world characters is Unicode, which, in Linux, is usually encoded in UTF-8.

It is that encoding the most common for present day terminals and programs in Linux.

From more general to particular settings:

OS

The default in debian since Etch on Apr 8th 2007 (13 years ago) has been utf-8.

Note : Fresh Debian/Etch installation have UTF8 enabled by default.

And confirmed on the release notes:

The default encoding for new Debian GNU/Linux installations is UTF-8. A number of applications will also be set up to use UTF-8 by default.

What that means is that Debian (and Ubuntu, Mint, and many other) are utf-8 capable by default.

locale

Which encoding (and country) is actually chosen by the user with the command dpkg-reconfigure locales is left to user preferences.

That configure the actual particular setting for the computer locale command.

All of the LC_* "environment variables" have specific effects on each of country/language sections (parts) as defined by the POSIX spec.

tty

But the above are just "general" settings. A particular terminal may (or may not) match it. Well, in general, the usual encoding for most terminals today is utf8.

The encoding for a particular terminal (tty) may be found if set to utf8 with:

$ stty -a | grep -o '.iutf8'
 iutf8

That is, no - before the result printed.

terminal

But the terminal (GUI window) inside which the tty terminal is (usually) running also has its own locale setting. If the settings are sane, probably:

$ locale charmap
UTF-8

Will have the correct answer.

But that is just a quick and very shallow look at all the i18n settings of linux/unix.

Take away: Probably, assuming Linux is using utf8 is your best bet.