If I run:
scp me@example.com:/home/me/cömmön_file.jpg /home/me/
from my remote server I get:
scp: /home/me/cömmön_file.jpg: No such file or directory
If I swap out the utf8 characters though with a wildcard it will work:
scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/
and/or
scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/
If I use the AWS CLI on my remote machine the behavior also replicates.
Running other commands with the explicit name in them on my remote machine functions as I'd expect.
e.g.
ls -lha /home/me/cömmön_file.jpg
-rw-r–r–. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg
I can rename the file as well with mv
.
Is the problem with transmitting the file, or something underlying in my machine hosting the file?
The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö
to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.
mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file
The server hosting the file is:
NAME="CentOS Linux"
VERSION="7 (Core)"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
the server requesting the file is:
NAME="Amazon Linux"
VERSION="2"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Best Answer
Quick solution: do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network
scp
,rsync
, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.It doesn't work, because you did not type the same filename.
Seems crazy ? That's UTF-8 to you.
Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.
More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.
While copy-pasting the answer above:
Please pay close attention to how the letter 'ö' is coded :
6f cc 88
. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')When when I (=Linux user) type:
Again look closely at the 'ö' symbol :
c3 b6
, an entirely different UTF-8 code point and no extra litteral ASCII.Ultra short explanation : UTF-8 normalization (composition vs decomposition).
Longer explanation :
in Unicode, there are multiple way to code for something that looks like 'ö'.
it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂
hum.
The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).
Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).
Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.
Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence
6f cc 88
andc3 b6
even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter). So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').