Unix Filenames – Understanding File Name Encoding

character encodingfilenameslocalespecial characters

I have a hard time understanding how the file name encoding works. On unix.SE
I find contradicting explanations.

File names are stored as characters

To quote another answer:
Several questions about file-system character encoding on linux

[…] as you mention in your question, a UNIX file name is just a sequence of
characters; the kernel knows nothing about the encoding, which entirely a
user-space (i.e., application-level) concept.

If file names are stored as characters, there has to be some kind of encoding
involved, since finally the file name has to end up as a bit or byte sequence
on the disk. If the user can choose any encoding to map the characters to a
byte sequence that is fed to the kernel, it is possible to create any byte
sequence for a valid file name.

Assume the following: A user uses a random encoding X, which translates
the file foo into the byte sequence α and saves it to disk. Another user
uses encoding Y. In this encoding α translates to /, which is not
allowed as a file name. However, for the first user the file is valid.

I assume that this scenario cannot happen.

File names are stored as binary blobs

To quote another answer:
What charset encoding is used for filenames and paths on Linux?

As noted by others, there isn't really an answer to this: filenames and
paths do not have an encoding; the OS only deals with sequence of bytes.
Individual applications may choose to interpret them as being encoded in
some way, but this varies.

If the system does not deal with characters, how can particular characters
(e.g. / or NULL) be forbidden in file names? There no notion of a /
without an encoding.

An explanation would be that file system can store file names containing any
character and it's only the user programs that take an encoding into account
that would choke on file names containing invalid characters. That, in turn,
means that file systems and the kernel can, without any difficulty, handle
file names containing a /.

I also assume that this is wrong.

Where does the encoding take place and where is the restriction posed of not
allowing particular characters?

Best Answer

Short answer: restrictions imposed in Unix/Linux/BSD kernel, namei() function. Encoding takes place in user level programs like xterm, firefox or ls.

I think you're starting from incorrect premises. A file name in Unix is a string of bytes with arbitrary values. A few values, 0x0 (ASCII Nul) and 0x2f (ASCII '/') are just not allowed, not as part of a multi-byte character encoding, not as anything. A "byte" can contain a number representing a character (in ASCII and some other encodings) but a "character" can require more than 1 byte (for example, code points above 0x7f in UTF-8 representation of Unicode).

These restrictions arise from file name printing conventions and the ASCII character set. The original Unixes used ASCII '/' (numerically 0x2f) valued bytes to separate pieces of a partially- or fully-qualified path (like '/usr/bin/cat' has pieces "usr", "bin" and "cat"). The original Unixes used ASCII Nul to terminate strings. Other than those two values, bytes in file names may assume any other value. You can see an echo of this in the UTF-8 encoding for Unicode. Printable ASCII characters, including '/', take only one byte in UTF-8. UTF-8 for code points above does not include any Zero-valued bytes, except for the Nul control character. UTF-8 was invented for Plan-9, The Pretender to the Throne of Unix.

Older Unixes (and it looks like Linux) had a namei() function that just looks at paths a byte at a time, and breaks the paths into pieces at 0x2F valued bytes, stopping at a zero-valued byte. namei() is part of the Unix/Linux/BSD kernel, so that's where the exceptional byte values get enforced.

Notice that so far, I've talked about byte values, not characters. namei() does not enforce any character semantics on the bytes. That's up to the user-level programs, like ls, which might sort file names based on byte values, or character values. xterm decides what pixels to light up for file names based on the character encoding. If you don't tell xterm you've got UTF-8 encoded filenames, you'll see a lot of gibberish when you invoke it. If vim isn't compiled to detect UTF-8 (or whatever, UTF-16, UTF-32) encodings, you'll see a lot of gibberish when you open a "text file" containing UTF-8 encoded characters.