character-encoding – Encoding of Cyrillic filenames in zip files

character encodingzip

There are a few questions here about non-ASCII letters in the names of files stored as streams inside zip files (Hebrew, Chinese, Japanese or Korean). However none of the solutions provided helped me with a zipfile with Cyrillic letters that came from a Windows machine.

The file has a cyrillic name itself (Космос.zip – downloadable link). This is an archive with zero-length contents just for the purpose of illustration.

unzip -l prints:

Archive:  Космос.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2017-05-03 18:19   ɫ���߼��/ict_inf.pdf
---------                     -------
        0                     1 file

The ugly ɫ���߼�� stands for the sequence of bytes C9 AB DF E8 AB DF BC AB DF.

I know (by using GMail preview feature) that this should be

Archive:  Космос.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2017-05-03 18:19   РосКосмос/ict_inf.pdf
---------                     -------
        0                     1 file

That is we need to map C9 AB DF E8 AB DF BC AB DF to РосКосмос.

There are several commonly used 8-bit cyrillic encodings: CP1251, CP866, ISO8859-5, however they would have this word encoded as a different sequence of bytes:

           Р  о  с  К  о  с  м  о  с
CP866:     90 AE E1 8A AE E1 AC AE E1
CP1251:    D0 EE F1 CA EE F1 EC EE F1
ISO8859-5: C0 DE E1 BA DE E1 DC DE E1

Clearly none of the commonly used 8-bit cyrillic encodings would decode the input names to the output names like this. There is something more complicated at work here.

If only we knew how to decode the names, renaming the files after extraction would be easy with an appropriate find script (https://unix.stackexchange.com/a/252000/17649), e.g.

find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | here-goes-the-decoding pipeline )"' sh {} \;

or the convmv utility.

Best Answer

Your ZIP file used with a "recent" infozip displays the right filenames:

unzip -l Russian-Космос.zip 
Archive:  Russian-Космос.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2017-05-03 18:19   РосКосмос/ict_inf.pdf
---------                     -------
        0                     1 file

And unzip correctly creates the РосКосмос/ directory when unzipping.

UTF-8 support has been added to infozip long ago. Executables on my Ubuntu:

UnZip 6.00, 20 April 2009
Zip 3.0,  July 5th 2008

So your problem may be an ancient InfoZip version (or a version compiled without UTF-8 support)

In my version, strings /usr/bin/unzip | grep -A8 -B8 'UTF-8' yields, among other things:

ZIP64_SUPPORT (archives using Zip64 for large files supported)
LARGE_FILE_SUPPORT (large files over 2 GiB supported)
other
UTF-8
UNICODE_SUPPORT [wide-chars, char coding: %s] (handle UTF-8 paths)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)

which seems to be related to compile/build options