There are a few questions here about non-ASCII letters in the names of files stored as streams inside zip files (Hebrew, Chinese, Japanese or Korean). However none of the solutions provided helped me with a zipfile with Cyrillic letters that came from a Windows machine.
The file has a cyrillic name itself (Космос.zip – downloadable link). This is an archive with zero-length contents just for the purpose of illustration.
unzip -l
prints:
Archive: Космос.zip
Length Date Time Name
--------- ---------- ----- ----
0 2017-05-03 18:19 ɫ�����/ict_inf.pdf
--------- -------
0 1 file
The ugly ɫ�����
stands for the sequence of bytes C9 AB DF E8 AB DF BC AB DF
.
I know (by using GMail preview feature) that this should be
Archive: Космос.zip
Length Date Time Name
--------- ---------- ----- ----
0 2017-05-03 18:19 РосКосмос/ict_inf.pdf
--------- -------
0 1 file
That is we need to map C9 AB DF E8 AB DF BC AB DF
to РосКосмос
.
There are several commonly used 8-bit cyrillic encodings: CP1251, CP866, ISO8859-5, however they would have this word encoded as a different sequence of bytes:
Р о с К о с м о с
CP866: 90 AE E1 8A AE E1 AC AE E1
CP1251: D0 EE F1 CA EE F1 EC EE F1
ISO8859-5: C0 DE E1 BA DE E1 DC DE E1
Clearly none of the commonly used 8-bit cyrillic encodings would decode the input names to the output names like this. There is something more complicated at work here.
If only we knew how to decode the names, renaming the files after extraction would be easy with an appropriate find
script (https://unix.stackexchange.com/a/252000/17649), e.g.
find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | here-goes-the-decoding pipeline )"' sh {} \;
or the convmv utility.
Best Answer
Your ZIP file used with a "recent" infozip displays the right filenames:
And unzip correctly creates the
РосКосмос/
directory when unzipping.UTF-8 support has been added to infozip long ago. Executables on my Ubuntu:
So your problem may be an ancient InfoZip version (or a version compiled without UTF-8 support)
In my version,
strings /usr/bin/unzip | grep -A8 -B8 'UTF-8'
yields, among other things:which seems to be related to compile/build options