Windows – Why do the file names look ‘normal’ in Linux but not remotely on Windows

character encodingfilenameswindows

While working with a colleague I found a strange issue that seems related to encoding. We're working with some images that have simple enough file names such as city.gif or wine.gif, but as one might expect things get more complicated when using special characters such as é, ë, à. We're also working with Dutch data that has these characters, e.g.café (pub). (We do not have control over the origin of the files.) Here's where the issues start to arise. The following file names are just an example. The issue also occurs for other characters with diacritics.

café-2.png
cafetaria.png
café.png

The first and last item should have an accented e in there (accent aigu, é). That's how it's shown in Linux (CentOS 6 & 7) in a terminal when running ls. But here comes Windows! (Using Windows 10, 64 bit.) When connected on Windows through SSL with our server and then calling ls, the list above looks like this:

café-2.png
cafetaria.png
caf▒.png

As you can hopefully see, the first line still has the accented e é, but the third one doesn't. Instead, I see ▒ this character – which is medium shade in unicode (9618 decimal). This is strange in itself. However, when I connect through SFTP with Filezilla (still on Windows) I get to see this:

cafÃ©-2.png
cafetaria.png
café.png

So now things have turned around: in the first one, é has changed into the sequence and in the third one everything's fine. I found here that this is most likely due to a Latin-1 <-> UTF-8 conversion that went wrong, if I got it right. But that can't be all that's going on, right?

Linux shows everything as we'd expect, Windows shows seemingly inconsistent behaviour depending on the way we view the filename (SSH (putty), or SFTP (filezilla)). Is there a way to 'normalise' these filenames – i.e. edit them -, and make sure that they are all the same on every OS; or at least consistent, and if so, how? UTF-8 is our encoding of choice.

Even though this may same merely an aesthetic issue, it isn't. When trying to download things through SFTP in Windows from our Linux server, I cannot download the files that have the issue mentioned above. Filezilla will throw an error such as Can't download file cafÃ©-2.png: cafÃ©-2.png does not exist on the server. Which seems to me that Filezilla reads the directory and the filename, interprets it in some encoding, sends a GET request to the server with its interpretation, but that interpretation differs from the Linux file name so consequently the file is not found.

Ultimately it would be nice if there is a solution available, even though I am also interested in why this happens. Does it occur because the image files were possibly created on different Operating Systems? Does it occur because the Linux server interprets them wrong, or is Windows messing up? Hopefully there is a solution where we can just contact our sysadmin and ask them to turn on a switch in the server config, but I'm afraid it's not as easy as that.

Best Answer

But here comes Windows!

Windows has nothing whatsoever to do with this. You could reproduce this same exact behaviour with a local instance of (say) GNOME Terminal, with appropriately selected terminal encoding and appropriately configured locale for ls, without any Windows being in the picture at all.

The only thing that Windows does is clearly show what is going on here. Your Windows FTP program is taking the bytes in the filenames and displaying them as the relevant code points in code page 1252. This, a single-byte encoding with almost everything above 0x1F a printable glyph, tells us exactly what the bytes in your filenames are.

Your second filename is largely uninformative, but the first and third are telling.

The first filename is the byte sequence 63 61 66 c3 a9 2d 32 2e 70 6e 67 — In code page 1252 this is cafÃ©-2.png. It is also the UTF-8 encoding of café-2.png.
The third filename is the byte sequence 63 61 66 e9 2e 70 6e 67 — In code page 1252 this is café.png. It is, however, not a valid UTF-8 encoding. e9 begins an incomplete character encoding sequence.

So what is happening is that the things are are not using code page 1252 but that are using UTF-8, namely your SSH session and your local terminal emulator, are handling the valid UTF-8 in the same way as one another but are handling the invalid UTF-8 in two different ways:

The one that is displaying the block graphic is very probably simply using that block graphic as the general replacement output character for invalid UTF-8 sequences.
The one that is displaying the letter é is falling back to Code Page 1252 when it encounters an invalid encoding.

Your underlying problem is a system that is somehow generating some filenames encoded as UTF-8 and other filenames encoded in Code Page 1252.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

Best Answer

Related Solutions

Ssh and character encoding

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Related Question