character-encoding – Encoding of Cyrillic filenames in zip files

There are a few questions here about non-ASCII letters in the names of files stored as streams inside zip files (Hebrew, Chinese, Japanese or Korean). However none of the solutions provided helped me with a zipfile with Cyrillic letters that came from a Windows machine.

The file has a cyrillic name itself (Космос.zip – downloadable link). This is an archive with zero-length contents just for the purpose of illustration.

unzip -l prints:

Archive:  Космос.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2017-05-03 18:19   ɫ���߼��/ict_inf.pdf
---------                     -------
        0                     1 file

The ugly ɫ��߼�� stands for the sequence of bytes C9 AB DF E8 AB DF BC AB DF.

I know (by using GMail preview feature) that this should be

Archive:  Космос.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2017-05-03 18:19   РосКосмос/ict_inf.pdf
---------                     -------
        0                     1 file

That is we need to map C9 AB DF E8 AB DF BC AB DF to РосКосмос.

There are several commonly used 8-bit cyrillic encodings: CP1251, CP866, ISO8859-5, however they would have this word encoded as a different sequence of bytes:

           Р  о  с  К  о  с  м  о  с
CP866:     90 AE E1 8A AE E1 AC AE E1
CP1251:    D0 EE F1 CA EE F1 EC EE F1
ISO8859-5: C0 DE E1 BA DE E1 DC DE E1

Clearly none of the commonly used 8-bit cyrillic encodings would decode the input names to the output names like this. There is something more complicated at work here.

If only we knew how to decode the names, renaming the files after extraction would be easy with an appropriate find script (https://unix.stackexchange.com/a/252000/17649), e.g.

find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | here-goes-the-decoding pipeline )"' sh {} \;

or the convmv utility.

unzip -l Russian-Космос.zip Archive: Russian-Космос.zip Length Date Time Name --------- ---------- ----- ---- 0 2017-05-03 18:19 РосКосмос/ict_inf.pdf --------- ------- 0 1 file

ZIP64_SUPPORT (archives using Zip64 for large files supported) LARGE_FILE_SUPPORT (large files over 2 GiB supported) other UTF-8 UNICODE_SUPPORT [wide-chars, char coding: %s] (handle UTF-8 paths) USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported) USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)

Best Answer

Your ZIP file used with a "recent" infozip displays the right filenames:

And unzip correctly creates the РосКосмос/ directory when unzipping.

UTF-8 support has been added to infozip long ago. Executables on my Ubuntu:

UnZip 6.00, 20 April 2009
Zip 3.0,  July 5th 2008

So your problem may be an ancient InfoZip version (or a version compiled without UTF-8 support)

In my version, strings /usr/bin/unzip | grep -A8 -B8 'UTF-8' yields, among other things:

which seems to be related to compile/build options

Best Answer

Related Solutions

What charset encoding is used for filenames and paths on Linux

Encoding of a zip file

Related Question