How to specify character encoding for 7z

7zlocale

Doing 7z x on an archive gives me

'20 ª.1 ¯® '$'\302\212''¨à®¢®£à ¤áª ï ã«.rtf'  IMG_6527.JPG
''$'\302\212''¨à®¢®£à ¤áª ï, ¨áâ.doc'          IMG_6532.JPG
''$'\302\204''®¯  ᮣ« è¥­¨¥(3).doc'           IMG_6542.JPG
''$'\302\204\302\212\302\217''.doc'        IMG_6543.JPG IMG_6526.JPG

Clearly some files were encoded differently and 7z by default does not convert to UTF-8. How to tell 7z to do the conversion?

The only options I found for charset:

-scc{UTF-8|WIN|DOS}: set charset for for console input/output
-scs{UTF-8|UTF-16LE|UTF-16BE|WIN|DOS|{id}}: set charset for list files

WIN, DOS, UTF-8 do not work. When trying to guess charset via

7z -scsCP1251 l 26-08-2016_10-18-14.zip

7z gives warning:

Unsupported charset: cp1251

unzip does this right (cyrillic symbols got converted):

'20 к.1 по Кировоградская ул.rtf'  IMG_6532.JPG  'Доп  соглашение(3).doc'
26-08-2016_10-18-14.zip        IMG_6542.JPG  'Кировоградская, ист.doc'
IMG_6526.JPG               IMG_6543.JPG
IMG_6527.JPG               ДКП.doc

Supplementary information

  • p7zip Version:
    15.14.1 (locale=ru_RU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs AMD Phenom(tm) II X4 960T Processor (100FA0),ASM)
    
  • hexdump of start of archive (od -tx1z -Ax):
    000000 50 4b 03 04 14 00 00 00 00 00 81 54 1a 49 7e 35  >PK.........T.I~5<
    000010 fa 34 00 ec 00 00 00 ec 00 00 07 00 17 00 84 8a  >.4..............<
    000020 8f 2e 64 6f 63 75 70 13 00 01 19 fd 45 54 d0 94  >..docup.....ET..<
    000030 d0 9a d0 9f 2e 64 6f 63 00 00 00 00 d0 cf 11 e0  >.....doc........<
    000040 a1 b1 1a e1 00 00 00 00 00 00 00 00 00 00 00 00  >................<
    000050 00 00 00 00 3e 00 03 00 fe ff 09 00 06 00 00 00  >....>...........<
    000060 00 00 00 00 00 00 00 00 01 00 00 00 71 00 00 00  >............q...<
    000070 00 00 00 00 00 10 00 00 73 00 00 00 01 00 00 00  >........s.......<
    000080 fe ff ff ff 00 00 00 00 70 00 00 00 ff ff ff ff  >........p.......<
    000090 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  >................<
    *
    000230 ff ff ff ff ff ff ff ff ff ff ff ff ec a5 c1 00  >................<
    000240 07 80 19 04 00 00 f0 12 bf 00 00 00 00 00 00 10  >................<
    000250 00 00 00 00 00 08 00 00 72 7b 00 00 0e 00 62 6a  >........r{....bj<
    000260 62 6a 2a 16 2a 16 00 00 00 00 00 00 00 00 00 00  >bj*.*...........<
    000270 00 00 00 00 00 00 00 00 19 04 16 00 34 8e 00 00  >............4...<
    000280 48 7c 00 00 48 7c 00 00 4b 2c 00 00 00 00 00 00  >H|..H|..K,......<
    000290 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
    0002a0 00 00 00 00 00 00 00 00 ff ff 0f 00 00 00 00 00  >................<
    0002b0 00 00 00 00 ff ff 0f 00 00 00 00 00 00 00 00 00  >................<
    0002c0 ff ff 0f 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
    0002d0 00 00 00 00 b7 00 00 00 00 00 3e 0e 00 00 00 00  >..........>.....<
    0002e0 00 00 3e 0e 00 00 a0 1b 00 00 00 00 00 00 a0 1b  >..>.............<
    0002f0 00 00 00 00 00 00 a0 1b 00 00 00 00 00 00 a0 1b  >................<
    000300 00 00 00 00 00 00 a0 1b 00 00 14 00 00 00 00 00  >................<
    000310 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 b4 1b  >................<
    000320 00 00 00 00 00 00 b4 1b 00 00 00 00 00 00 b4 1b  >................<
    000330 00 00 38 00 00 00 ec 1b 00 00 84 00 00 00 70 1c  >..8...........p.<
    000340 00 00 34 00 00 00 b4 1b 00 00 00 00 00 00 b8 28  >..4............(<
    000350 00 00 e6 01 00 00 a4 1c 00 00 00 00 00 00 a4 1c  >................<
    000360 00 00 00 00 00 00 a4 1c 00 00 00 00 00 00 a4 1c  >................<
    000370 00 00 00 00 00 00 a4 1c 00 00 00 00 00 00 d8 1d  >................<
    000380 00 00 00 00 00 00 d8 1d 00 00 00 00 00 00 d8 1d  >................<
    000390 00 00 00 00 00 00 43 28 00 00 02 00 00 00 45 28  >......C(......E(<
    0003a0 00 00 00 00 00 00 45 28 00 00 00 00 00 00 45 28  >......E(......E(<
    *
    0003c0 00 00 00 00 00 00 45 28 00 00 00 00 00 00 9e 2a  >......E(.......*<
    0003d0 00 00 a2 02 00 00 40 2d 00 00 da 00 00 00 45 28  >......@-......E(<
    0003e0 00 00 2d 00 00 00 00 00 00 00 00 00 00 00 00 00  >..-.............<
    0003f0 00 00 00 00 00 00 a0 1b 00 00 00 00 00 00 d8 1d  >................<
    000400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
    000410 00 00 00 00 00 00 d8 1d 00 00 00 00 00 00 d8 1d  >................<
    000420
    

Best Answer

Depending on the encoding used to create the zip file, you might be able to prevent unwanted translations by temporarily setting the locale to "C":

LC_ALL=C 7z x $archive

(This helped for a zip created by IZArc on Win7, using two of your example filenames.)

However, for the archive in the question, the "filename" field contains the CP1251 encoding of "ДКП.doc" (84 8a 8f 2e 64 6f 63). The "extra" field uses an Info-zip extension (see section 4.6.9 of the Zip Specification v 6.3.4 ) to store the UTF-8 filename. unzip knows about this header, and uses the UTF-8 name, ignoring the CP1251 one.

7z doesn't do anything with this "extra field", and only uses the CP1251 one. Depending on the current locale, it might create the file using that exact name (the raw bytes 84 8a 8f), or worse, treat them as unicode points to be expanded to UTF-8 first (c2 84 c2 8a c2 8f).

One option is to use external utilities to change the zip first:

#!/bin/bash

cp orig.zip renamed.zip

index=0
zipinfo -1 orig.zip | while read name ; do
        ziptool renamed.zip rename $index "$name"
        index=$((index+1))
done

ziptool is from libzip. zipinfo is distributed with Info-ZIP's UnZip, so you might as well have just used unzip.