Linux – Why is Zip able to compress single file smaller than multiple files with the same content

linuxzip

Suppose that I have 10,000 XML files. Now suppose that I want to send them to a friend. Before sending them, I would like to compress them.

Method 1: Don't compress them

Results:

Resulting Size: 62 MB
Percent of initial size: 100%

Method 2: Zip every file and send him 10,000 xml files

Command:

for x in $(ls -1) ;  do   echo $x ; zip "$x.zip" $x ; done

Results:

Resulting Size: 13 MB
Percent of initial size: 20%

Method 3: Create a single zip containing 10,000 xml files

Command:

zip all.zip $(ls -1)

Results:

Resulting Size: 12 MB
Percent of initial size: 19%

Method 4: Concatenate the files into a single file & zip it

Command:

cat *.xml > oneFile.txt ; zip oneFile.zip oneFile.txt

Results:

Resulting Size: 2 MB
Percent of initial size: 3%

Questions:

  • Why do I get such dramatically better results when I am just zipping a single file?
  • I was expecting to get drastically better results using method 3 than method 2, but don't. Why?
  • Is this behaviour specific to zip? If I tried using gzip would I get different results?

Additional info:

$ zip --version
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
Currently maintained by E. Gordon.  Please send bug reports to
the authors using the web page at www.info-zip.org; see README for details.

Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip,
as of above date; see http://www.info-zip.org/ for other sites.

Compiled with gcc 4.4.4 20100525 (Red Hat 4.4.4-5) for Unix (Linux ELF) on Nov 11 2010.

Zip special compilation options:
    USE_EF_UT_TIME       (store Universal Time)
    SYMLINK_SUPPORT      (symbolic links supported)
    LARGE_FILE_SUPPORT   (can read and write large files on file system)
    ZIP64_SUPPORT        (use Zip64 to store large files in archives)
    UNICODE_SUPPORT      (store and read UTF-8 Unicode paths)
    STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field)
    UIDGID_NOT_16BIT     (old Unix 16-bit UID/GID extra field not used)
    [encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)

Edit: Meta data

One answer suggests that the difference is the system meta data that is stored in the zip. I don't think that this can be the case. To test, I did the following:

for x in $(seq 10000) ; do touch $x ; done
zip allZip $(ls -1)

The resulting zip is 1.4MB. This means that there is still ~10 MB of unexplained space.

Best Answer

Zip treats the contents of each file separately when compressing. Each file will have its own compressed stream. There is support within the compression algorithm (typically DEFLATE) to identify repeated sections. However, there is no support in Zip to find redundancy between files.

That's why there is so much extra space when the content is in multiple files: it's putting the same compressed stream in the file multiple times.

Related Question