What charset encoding is used for filenames and paths on Linux

character encodingfilenameslocale

Does it depend on what file system I use? For example, ext2/ext3/ext4 but also what happens when I insert one of those "joliet" CD-ROMs with ISO 9660? I've heard that POSIX contains some sort of spec for the charset encoding of filenames?

Essentially, what I wonder is if I got a UTF-8 encoded filename, what processing/coversion do I need to do before I pass it to a file I/O API in Linux?

Best Answer

As noted by others, there isn't really an answer to this: filenames and paths do not have an encoding; the OS only deals with sequence of bytes. Individual applications may choose to interpret them as being encoded in some way, but this varies.

Specifically, Glib (used by Gtk+ apps) assumes that all file names are UTF-8 encoded, regardless of the user's locale. This may be overridden with the environment variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES.

On the other hand, Qt defaults to assuming that all file names are encoded in the current user's locale. An individual application may choose to override this assumption, though I do not know of any that do, and there is no external override switch.

Modern Linux distributions are set up such that all users are using UTF-8 locales and paths on foreign filesystem mounts are translated to UTF-8, so this difference in strategies generally has no effect. However, if you really want to be safe, you cannot assume any structure about filenames beyond "NUL-terminated, '/'-delimited sequence of bytes".

(Also note: locale may vary by process. Two different processes run by the same user may be in different locales simply by having different environment variables set.)