Unix Filenames – Understanding File Name Encoding

character encodingfilenameslocalespecial characters

I have a hard time understanding how the file name encoding works. On unix.SE
I find contradicting explanations.

File names are stored as characters

To quote another answer:
Several questions about file-system character encoding on linux

[…] as you mention in your question, a UNIX file name is just a sequence of
characters; the kernel knows nothing about the encoding, which entirely a
user-space (i.e., application-level) concept.

If file names are stored as characters, there has to be some kind of encoding
involved, since finally the file name has to end up as a bit or byte sequence
on the disk. If the user can choose any encoding to map the characters to a
byte sequence that is fed to the kernel, it is possible to create any byte
sequence for a valid file name.

Assume the following: A user uses a random encoding X, which translates
the file foo into the byte sequence α and saves it to disk. Another user
uses encoding Y. In this encoding α translates to /, which is not
allowed as a file name. However, for the first user the file is valid.

I assume that this scenario cannot happen.

File names are stored as binary blobs

To quote another answer:
What charset encoding is used for filenames and paths on Linux?

As noted by others, there isn't really an answer to this: filenames and
paths do not have an encoding; the OS only deals with sequence of bytes.
Individual applications may choose to interpret them as being encoded in
some way, but this varies.

If the system does not deal with characters, how can particular characters
(e.g. / or NULL) be forbidden in file names? There no notion of a /
without an encoding.

An explanation would be that file system can store file names containing any
character and it's only the user programs that take an encoding into account
that would choke on file names containing invalid characters. That, in turn,
means that file systems and the kernel can, without any difficulty, handle
file names containing a /.

I also assume that this is wrong.

Where does the encoding take place and where is the restriction posed of not
allowing particular characters?

Best Answer

Short answer: restrictions imposed in Unix/Linux/BSD kernel, namei() function. Encoding takes place in user level programs like xterm, firefox or ls.

I think you're starting from incorrect premises. A file name in Unix is a string of bytes with arbitrary values. A few values, 0x0 (ASCII Nul) and 0x2f (ASCII '/') are just not allowed, not as part of a multi-byte character encoding, not as anything. A "byte" can contain a number representing a character (in ASCII and some other encodings) but a "character" can require more than 1 byte (for example, code points above 0x7f in UTF-8 representation of Unicode).

These restrictions arise from file name printing conventions and the ASCII character set. The original Unixes used ASCII '/' (numerically 0x2f) valued bytes to separate pieces of a partially- or fully-qualified path (like '/usr/bin/cat' has pieces "usr", "bin" and "cat"). The original Unixes used ASCII Nul to terminate strings. Other than those two values, bytes in file names may assume any other value. You can see an echo of this in the UTF-8 encoding for Unicode. Printable ASCII characters, including '/', take only one byte in UTF-8. UTF-8 for code points above does not include any Zero-valued bytes, except for the Nul control character. UTF-8 was invented for Plan-9, The Pretender to the Throne of Unix.

Older Unixes (and it looks like Linux) had a namei() function that just looks at paths a byte at a time, and breaks the paths into pieces at 0x2F valued bytes, stopping at a zero-valued byte. namei() is part of the Unix/Linux/BSD kernel, so that's where the exceptional byte values get enforced.

Notice that so far, I've talked about byte values, not characters. namei() does not enforce any character semantics on the bytes. That's up to the user-level programs, like ls, which might sort file names based on byte values, or character values. xterm decides what pixels to light up for file names based on the character encoding. If you don't tell xterm you've got UTF-8 encoded filenames, you'll see a lot of gibberish when you invoke it. If vim isn't compiled to detect UTF-8 (or whatever, UTF-16, UTF-32) encodings, you'll see a lot of gibberish when you open a "text file" containing UTF-8 encoded characters.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

File names are stored as characters

File names are stored as binary blobs

Best Answer

Related Solutions

How to Filter Invalid UTF-8 Characters – Command Line Techniques

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Related Question