Linux – UTF 8 filenames

filenameslinuxunicode

In unix based operating systems are utf6 filenames permissible? If so do I need to do anything special to write the file to disk.

Let me explain what I'm hoping to do. I'm writing an application that will transfer a file via ftp to a remote system but the filename is dynamically set to via some set of meta data which potentially could be in utf8. I'm wondering if there's something I need to do to write the file to disk in unix/linux.

Also as a follow up does anyone know what would happen if I did upload a utf 8 filename to a system doesn't support utf8?

Best Answer

On Unix/Linux, a filename is a sequence of any bytes except for a slash or a NUL. A slash separates path components, and a NUL terminates a path name.

So, you can use whatever encoding you want for filenames. Some applications may have trouble with some encodings if they are naïve about what characters may be in filenames - for example, poorly-written shell scripts often do not handle filenames with spaces.

Modern Unix/Linux environments handle UTF-8 encoded filenames just fine.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

Bash – Rename a lot of files with all types of characters, with POSIX portability

Actually, for i in *; do something; done treats every file name correctly, except that file names that begin with a . are excluded from the wildcard matching. To match all files (except . and ..) portably, match * .[!.]* ..?* and skip any nonexistent file resulting from a non-matching pattern being left intact.

If you experienced problems, it's probably because you didn't quote $i properly later on. Always put double quotes around variable substitutions and command substitutions: "$foo", "$(cmd)" unless you intend field splitting and globbing to happen.

If you need to pass the file name to an external command (you don't, here), be careful that echo "$foo" does not always print $foo literally. A few shells perform backslash expansion, and a few values of $foo beginning with - will be treated as an option. The safe and POSIX-compliant way to print a string exactly is

printf '%s' "$foo"

or printf '%s\n' "$foo" to add a newline at the end. Another thing to watch out for is that command substitution removes trailing newlines; if you need to retain newlines, a possible trick is to append a non-newline character to the data, make sure the transformation retains this character, and finally truncate this character. For example:

mangled_file_name="$(printf '%sa' "$file_name" | tr -sc '[:alnum:]-+_.' '[_*]')"
mangled_file_name="${mangled_file_name%a}"

To extract the md5sum of the file, avoid having the file name in the md5sum output, since that will make it hard to strip. Pass the data on md5sum's standard input.

Note that the md5sum command is not in POSIX. A few unix variants have md5 or nothing at all. cksum is POSIX but collision-prone.

See Grabbing the extension in a file name on how to get the file's extension.

Let's put it all together (untested). Everything here works under any POSIX shell; you could gain a little, but not much, from bash features.

for old_name in * .[!.]* ..?*; do
  if ! [ -e "$old_name" ]; then continue; fi
  hash=$(md5sum <"$old_name")
  case "$old_name" in
    *.*.gz|*.*.bz2)                   # double extension
      ext=".${old_name##*.}"
      tmp="${old_name%.*}"
      ext=".${old_name##*.}$ext";;
    ?*.*) ext=".${old_name##*.}";;    # simple extension
    *) ext=;;                         # no extension
  esac
  mv -- "$old_name" "$hash$ext"
done

Note that I did not consider the case where there is already a target file by the specified name. In particular, if you have existing files whose name looks like your adopted convention but where the checksum part doesn't match the file's contents and instead matches that of some other file with the same extension, what happens will depend on the relative lexicographic order of the file names.

Best Answer

Related Solutions

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Bash – Rename a lot of files with all types of characters, with POSIX portability

Related Question