Windows – Why do the file names look ‘normal’ in Linux but not remotely on Windows

character encodingfilenameswindows

While working with a colleague I found a strange issue that seems related to encoding. We're working with some images that have simple enough file names such as city.gif or wine.gif, but as one might expect things get more complicated when using special characters such as é, ë, à. We're also working with Dutch data that has these characters, e.g.café (pub). (We do not have control over the origin of the files.) Here's where the issues start to arise. The following file names are just an example. The issue also occurs for other characters with diacritics.

café-2.png
cafetaria.png
café.png

The first and last item should have an accented e in there (accent aigu, é). That's how it's shown in Linux (CentOS 6 & 7) in a terminal when running ls. But here comes Windows! (Using Windows 10, 64 bit.) When connected on Windows through SSL with our server and then calling ls, the list above looks like this:

café-2.png
cafetaria.png
caf▒.png

As you can hopefully see, the first line still has the accented e é, but the third one doesn't. Instead, I see this character – which is medium shade in unicode (9618 decimal). This is strange in itself. However, when I connect through SFTP with Filezilla (still on Windows) I get to see this:

café-2.png
cafetaria.png
café.png

So now things have turned around: in the first one, é has changed into the sequence and in the third one everything's fine. I found here that this is most likely due to a Latin-1 <-> UTF-8 conversion that went wrong, if I got it right. But that can't be all that's going on, right?

Linux shows everything as we'd expect, Windows shows seemingly inconsistent behaviour depending on the way we view the filename (SSH (putty), or SFTP (filezilla)). Is there a way to 'normalise' these filenames – i.e. edit them -, and make sure that they are all the same on every OS; or at least consistent, and if so, how? UTF-8 is our encoding of choice.

Even though this may same merely an aesthetic issue, it isn't. When trying to download things through SFTP in Windows from our Linux server, I cannot download the files that have the issue mentioned above. Filezilla will throw an error such as Can't download file café-2.png: café-2.png does not exist on the server. Which seems to me that Filezilla reads the directory and the filename, interprets it in some encoding, sends a GET request to the server with its interpretation, but that interpretation differs from the Linux file name so consequently the file is not found.

Ultimately it would be nice if there is a solution available, even though I am also interested in why this happens. Does it occur because the image files were possibly created on different Operating Systems? Does it occur because the Linux server interprets them wrong, or is Windows messing up? Hopefully there is a solution where we can just contact our sysadmin and ask them to turn on a switch in the server config, but I'm afraid it's not as easy as that.

Best Answer

But here comes Windows!

Windows has nothing whatsoever to do with this. You could reproduce this same exact behaviour with a local instance of (say) GNOME Terminal, with appropriately selected terminal encoding and appropriately configured locale for ls, without any Windows being in the picture at all.

The only thing that Windows does is clearly show what is going on here. Your Windows FTP program is taking the bytes in the filenames and displaying them as the relevant code points in code page 1252. This, a single-byte encoding with almost everything above 0x1F a printable glyph, tells us exactly what the bytes in your filenames are.

Your second filename is largely uninformative, but the first and third are telling.

  • The first filename is the byte sequence 63 61 66 c3 a9 2d 32 2e 70 6e 67 — In code page 1252 this is café-2.png. It is also the UTF-8 encoding of café-2.png.
  • The third filename is the byte sequence 63 61 66 e9 2e 70 6e 67 — In code page 1252 this is café.png. It is, however, not a valid UTF-8 encoding. e9 begins an incomplete character encoding sequence.

So what is happening is that the things are are not using code page 1252 but that are using UTF-8, namely your SSH session and your local terminal emulator, are handling the valid UTF-8 in the same way as one another but are handling the invalid UTF-8 in two different ways:

  • The one that is displaying the block graphic is very probably simply using that block graphic as the general replacement output character for invalid UTF-8 sequences.
  • The one that is displaying the letter é is falling back to Code Page 1252 when it encounters an invalid encoding.

Your underlying problem is a system that is somehow generating some filenames encoded as UTF-8 and other filenames encoded in Code Page 1252.

Related Question