Linux – Why is Zip able to compress single file smaller than multiple files with the same content

linuxzip

Suppose that I have 10,000 XML files. Now suppose that I want to send them to a friend. Before sending them, I would like to compress them.

Method 1: Don't compress them

Results:

Resulting Size: 62 MB
Percent of initial size: 100%

Method 2: Zip every file and send him 10,000 xml files

Command:

for x in $(ls -1) ;  do   echo $x ; zip "$x.zip" $x ; done

Results:

Resulting Size: 13 MB
Percent of initial size: 20%

Method 3: Create a single zip containing 10,000 xml files

Command:

zip all.zip $(ls -1)

Results:

Resulting Size: 12 MB
Percent of initial size: 19%

Method 4: Concatenate the files into a single file & zip it

Command:

cat *.xml > oneFile.txt ; zip oneFile.zip oneFile.txt

Results:

Resulting Size: 2 MB
Percent of initial size: 3%

Questions:

Why do I get such dramatically better results when I am just zipping a single file?
I was expecting to get drastically better results using method 3 than method 2, but don't. Why?
Is this behaviour specific to zip? If I tried using gzip would I get different results?

Additional info:

$ zip --version
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
Currently maintained by E. Gordon.  Please send bug reports to
the authors using the web page at www.info-zip.org; see README for details.

Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip,
as of above date; see http://www.info-zip.org/ for other sites.

Compiled with gcc 4.4.4 20100525 (Red Hat 4.4.4-5) for Unix (Linux ELF) on Nov 11 2010.

Zip special compilation options:
    USE_EF_UT_TIME       (store Universal Time)
    SYMLINK_SUPPORT      (symbolic links supported)
    LARGE_FILE_SUPPORT   (can read and write large files on file system)
    ZIP64_SUPPORT        (use Zip64 to store large files in archives)
    UNICODE_SUPPORT      (store and read UTF-8 Unicode paths)
    STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field)
    UIDGID_NOT_16BIT     (old Unix 16-bit UID/GID extra field not used)
    [encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)

Edit: Meta data

One answer suggests that the difference is the system meta data that is stored in the zip. I don't think that this can be the case. To test, I did the following:

for x in $(seq 10000) ; do touch $x ; done
zip allZip $(ls -1)

The resulting zip is 1.4MB. This means that there is still ~10 MB of unexplained space.

Best Answer

Zip treats the contents of each file separately when compressing. Each file will have its own compressed stream. There is support within the compression algorithm (typically DEFLATE) to identify repeated sections. However, there is no support in Zip to find redundancy between files.

That's why there is so much extra space when the content is in multiple files: it's putting the same compressed stream in the file multiple times.

Related Solutions

MacOS – Compress a folder into multiple zip files

Pass -s to zip.
zipsplit

Linux – Can we use pigz with –zip to compress multiple files in single zip-compatible format

I don't think pigz takes zip file name as an argument and we can then specify the files to zip.

It seems you're right.

Can we use pigz with --zip to compress multiple files in a single zip-compatible format?

Probably not; or not yet (new features can be added in the future, although adding this particular feature may not be the Right Thing; keep reading). I have found no way to do this. You need to put your files into a single archive first, then compress.

There's a reason for this. According to the Unix philosophy programs should follow the "Do One Thing And Do It Well" rule. Putting one or more files (directory is also a file) is one thing and we call it "archiving". Reducing the size is another thing and we call it "compressing". We have archivers, the common one is tar, the POSIX one is pax; and we have compressors: gzip, compress, bzip2, lzma, …

Some compressors and compressed file formats support storing multiple files because their authors were apparently not enlightened by the Unix philosophy. :)

But it's not only a philosophical issue, there are practical advantages:

You can use any archiver with any compressor. In particular you can pick another (e.g. better) compressor and still use the archiver you are most familiar with (probably GNU tar). Tools that work as both tend to invent their own options and rules for the common task of archiving.
If filesystems introduce new features then we will need to upgrade our archivers only.
If you invent a new compression method then you will be able to develop a new compressor without paying attention to how to traverse directory trees, what metadata to read or which character should separate pathname components.

pigz is a compressor and it seems it has no ambition to be an archiver. With --zip/-K it uses the .zip format associated with a tool that is designed to be a compressor and an archiver. pigz doesn't have to use all the features of the format, in particular the ability to store more than one file. It could be "improved" but now you know why I think this wouldn't be the Right Thing.

Still archiving-and-compressing is a pretty common use case. A good archiver should be able to write to its stdout. A good compressor should be able to read from its stdin. Then you can use them in a pipeline. This is a general way.

Specifically with tar you can use a switch that makes the tool filter (pipe) the archive through a compressor: -z for gzip, --lzma for lzma etc. A universal switch is -I, it allows you to use a custom compressor. The compressor can be pigz --zip:

tar -cv -I 'pigz --zip' -f archive.tar.zip file1 file2 file3

The same compressor can be used to unpack, if only it supports -d (pigz does):

tar -xv -I 'pigz --zip' -f archive.tar.zip

Technically this archive.tar.zip is a zip file with a tar file inside, so it's similar to your "zip inside zip". If you unzip it then you will get a tar archive named -. The above tar commands work on the fly though (no intermediate file created).

This is how you do it in Linux/Unix.

Method 1: Don't compress them

Method 2: Zip every file and send him 10,000 xml files

Method 3: Create a single zip containing 10,000 xml files

Method 4: Concatenate the files into a single file & zip it

Questions:

Additional info:

Edit: Meta data

Best Answer

Related Solutions

MacOS – Compress a folder into multiple zip files

Linux – Can we use pigz with –zip to compress multiple files in single zip-compatible format

Related Question