How to correctly decompress a ZIP archive of files with Hebrew names

character encodingfile formatunicodezip

Someone sent me a ZIP file containing files with Hebrew names (and created on Windows, not sure with which tool). I use LXDE on Debian Stretch. The Gnome archive manager manages to unzip the file, but the Hebrew characters are garbled. I think I'm getting UTF-8 octets extended into Unicode characters, e.g. I have a file whose name has four characters and a .doc suffic, and the characters are: 0x008E 0x0087 0x008E 0x0085 . Using the command-line unzip utility is even worse – it refuses to decompress altogether, complaining about an "Invalid or incomplete multibyte or wide character".

So, my questions are:

Is there another decompression utility that will decompress my files with the correct names?
Is there something wrong with the way the file was compressed, or is it just an incompatibility of ZIP implementations? Or even misfeature/bug of the Linux ZIP utilities?
What can I do to get the correct filenames after having decompressed using the garbled ones?

Best Answer

It sounds like the filenames are encoded in one of Windows' proprietary codepages (CP862, 1255, etc).

Is there another decompression utility that will decompress my files with the correct names? I'm not aware of a zip utility that supports these code pages natively. 7z has some understanding of encodings, but I believe it has to be an encoding your system knows about more generally (you pick it by setting the LANG environment variable) and Windows codepages likely aren't among those.

unzip -UU should work from the command line to create files with the correct bytes in their names (by disabling all Unicode support). That is probably the effect you got from GNOME's tool already. The encoding won't be right either way, but we can fix that below.
Is there something wrong with the way the file was compressed, or is it just an incompatibility of ZIP implementations? Or even misfeature/bug of the Linux ZIP utilities? The file you've been given was not created portably. That's not necessarily wrong for an internal use where the encoding is fixed and known in advance, although the format specification says that names are supposed to be either UTF-8 or cp437 and yours are neither. Even between Windows machines, using different codepages doesn't work out well, but non-Windows machines have no concept of those code pages to begin with. Most tools UTF-8 encode their filenames (which still isn't always enough to avoid problems).
What can I do to get the correct filenames after having decompressed using the garbled ones? If you can identify the encoding of the filenames, you can convert the bytes in the existing names into UTF-8 and move the existing files to the right name. The convmv tool essentially wraps up that process into a single command: convmv -f cp862 -t utf8 -r . will try to convert everything inside . from cp862 to UTF-8.

Alternatively, you can use iconv and find to move everything to their correct names. Something like:
```
find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | iconv -f cp862 -t utf8)"' sh {} \;
```
will find all the files underneath the current directory and try to convert the names into UTF-8.

In either case, you can experiment with different encodings and try to find one that makes sense.

After you've fixed the encoding for you, if you want to send these files back in the other direction it's possible you'll have the same problem on the other end. In that case, you can reverse the process before zipping the files up with -UU, since it's likely to be very hard to fix on the Windows end.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

How to list all *.doc files in a Zip archive, including files in subdirectories

zipinfo -1 zip.zip '*.doc'

works for me, displaying all files in sub-directories. I think you are forgetting the quotes around the *.doc. Without the quotes, the *.doc expands to all .doc files in the current directory, and then that is passed to zipinfo as the search pattern. So if you have an unzipped version of the archive present in the local directory, then the command will only show top-level .doc files.

With quotes, the argument is protected from the shell, so the wildcard actually makes it to zipinfo successfully.

Best Answer

Related Solutions

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

How to list all *.doc files in a Zip archive, including files in subdirectories

Related Question