Linux – Proper encoding for file names in zip archives created in Windows and unpacked in linux

character encodinglinuxspecial characterswindowszip

I have problems with different charsets in Windows and Linux (Centos).

I have files with special characters in their filenames from many different languages. The zip archive is generated under Win7 and uploaded on a Linux server. Under Windows all characters were displayed normal, as expected. But after uploading and extracting with, either phps' ZipArchive() or Linux unzip, some special characters were displayed with strange wrong characters.

I know that this is a known problem in the interplay between Windows and Linux, but I'm not able to solve my problem. I've tried to unzip my zip file with different charsets, but nothing worked for me. In Portuguese the charater õ makes a lot of problems, but ç is okay.

aplicações.txt is after unzipping aplicaçΣes.txt

As far as I understood it right, windows uses the ASCII code charset IBM860, but sometimes I read windows-1257 and I do not know which charset is used, when the zip archive is made with WinRar under Win7. Is there a way to check this, or tell WinRar to use UTF-8?

When the zip archive is uploaded to a linux os and unzipped by ZipArchive() (php) or on the Linux bash with unzip, the filenames are wrong. Think it is because linux used UTF-8.

Under linux command I tried:

unzip -O windows-1257 -d zipout/ 

Under linux command I tried:

unzip -O IBM860 -d zipout/ 

Under linux command I tried:

unzip -O IBM437 -d zipout/ 

Under linux command I tried:

unzip -O UTF-8 -d zipout/ 

Under linux command I tried:

unzip -O UTF-16 -d zipout/

Best Answer

If the language of your Windows 7 version used for zipping files is the Brazilian Portuguese language, then the encoding are probably IBM-850 or Windows-1252. Try these.

I have this issue too. But also happens when transferring between different languages of Windows. Between the English and the Brazilian Portuguese Windows versions, for example, the English version uses IBM-437 and the pt-BR version uses IBM-850.

If you use the WinZip for zipping, this issue does not happens. I do not recommend to use the built-in Windows to zipping and/or extracting, as this also causes that encoding issue on filenames.