Linux – UTF8 Character Makes File Inaccessible

character encodingfilesystemslinuxunicode

If I run:

scp me@example.com:/home/me/cömmön_file.jpg /home/me/

from my remote server I get:

scp: /home/me/cömmön_file.jpg: No such file or directory

If I swap out the utf8 characters though with a wildcard it will work:

scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/

and/or

scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/

If I use the AWS CLI on my remote machine the behavior also replicates.

Running other commands with the explicit name in them on my remote machine functions as I'd expect.

e.g.

ls -lha /home/me/cömmön_file.jpg

-rw-r–r–. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg

I can rename the file as well with mv.

Is the problem with transmitting the file, or something underlying in my machine hosting the file?

The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.

mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file

The server hosting the file is:

NAME="CentOS Linux"
VERSION="7 (Core)"

and its locale is:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

the server requesting the file is:

NAME="Amazon Linux"
VERSION="2"

and its locale is:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Best Answer

Quick solution: do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp, rsync, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.

It doesn't work, because you did not type the same filename.

Seems crazy ? That's UTF-8 to you.

Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.

More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.

While copy-pasting the answer above:

# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000  73 63 70 20 6d 65 40 65  78 61 6d 70 6c 65 2e 63  |scp me@example.c|
00000010  6f 6d 3a 2f 68 6f 6d 65  2f 6d 65 2f 63 6f cc 88  |om:/home/me/co..|
00000020  6d 6d 6f cc 88 6e 5f 66  69 6c 65 2e 6a 70 67 20  |mmo..n_file.jpg |
00000030  2f 68 6f 6d 65 2f 6d 65  2f 0a                    |/home/me/.|
0000003a

Please pay close attention to how the letter 'ö' is coded : 6f cc 88. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')

When when I (=Linux user) type:

echo /home/me/cömmön_file.jpg | hexdump -C
00000000  2f 68 6f 6d 65 2f 6d 65  2f 63 c3 b6 6d 6d c3 b6  |/home/me/c..mm..|
00000010  6e 5f 66 69 6c 65 2e 6a  70 67 0a                 |n_file.jpg.|
0000001b

Again look closely at the 'ö' symbol : c3 b6, an entirely different UTF-8 code point and no extra litteral ASCII.

Ultra short explanation : UTF-8 normalization (composition vs decomposition).

Longer explanation :

in Unicode, there are multiple way to code for something that looks like 'ö'.

first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)
second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)

it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂

hum.

The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).

Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).

Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.

Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88 and c3 b6 even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter). So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.

It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').

Related Solutions

How to Filter Invalid UTF-8 Characters – Command Line Techniques

If you want to use grep, you can do:

grep -axv '.*' file

in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).

How to convert unknown-8bit file to utf8

There is no reliable way to convert from an unknown encoding to a known one.

In your case, if you know the original text is in Farsi / Persian, maybe you can identify a number of possible encodings, and iterate over those until you see the output you expect.

Based on quick googling, there is no standard, stable converter for the legacy Iran System encoding, and the only remaining popular alternative is Windows codepage 1256. I have included MacArabic here mainly for illustrative purposes (though maybe it would even be a feasible alternative for Farsi, too?)

for encoding in cp1256 macarabic; do
    if iconv -f "$encoding" -t utf-8 inputfile >outputfile."$encoding"; then
        echo "$encoding: possible"
    else
        echo "$encoding: skipped"
        rm outputfile."$encoding"
    fi
done

(My version of iconv doesn't actually support MacArabic, but maybe you will have more luck; or you can try a different conversion tool.)

Examine the resulting output files; see if one of them seems to make sense.

If you know what the output should look like, you can also look up individual mappings for bytes in the file. If the first byte is 0x94 and you know it should display as ﭖ you have basically established that the encoding is Iran System. Maybe look up a few more bytes to verify this conclusion. The Wikipedia page for this encoding has a table of all the characters. Obviously, this is painstaking, slow, and error prone, especially if there are many candidate encodings to choose from.

For some encodings, you can find a list e.g. at https://tripleee.github.io/8bit/ -- for others, maybe you just have to look at the corresponding Wikipedia coding tables.

Best Answer

Related Solutions

How to Filter Invalid UTF-8 Characters – Command Line Techniques

How to convert unknown-8bit file to utf8

Related Question