How to evaluate the best choice of archive format for compressing files

7-ziparchivingcompressiongzipzip

In general, I've observed the following:

  • Linux-y files or tools use bzip2 or gzip for distributing archives
  • Windows-y files or tools use ZIP for distributing archives
  • Many people use 7-Zip for creating and distributing their own archives

Questions:

  • What are the advantages and disadvantages of these formats, all of which appear to be open formats? When/why should I choose one (say, 7-Zip) over another (say, ZIP)?
  • Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?

Best Answer

There are a large variety of compression formats and methods available, some don't compress at all and are designed to store a number of files in one archive, and other newer experimental compressors (PAQ based) are designed to compress as aggressively as possible, regardless of the time it takes to perform said operation.

You need to evaluate the features you require from your compression method choice, and also consider the context in which it will be used.

Different features and considerations include:

  • Compression ability - Does it shrink the file significantly enough?
  • Ease-of-use - If the file is going to another user, will the archive be easy to extract or will it require more software to be installed?
  • Password protection and/or encryption - Are these security measures required?
  • Multiple volumes support - If the target medium requires the file to be split into appropriate chunks, does the format support this elegantly. For example, 650 MB for a CD.
  • Repairing and recovery - If the file becomes partially corrupt, does it offer a recovery record to aid restoration of data?
  • Unicode support - Does the archiver support international file names or just standard ASCII?
  • System Requirements - Modern compressors such as 7-Zip do offer the ability to increase compression efficiency by using a larger dictionary (a dictionary is a reference of commonly repeated data in a compressed file), but this in turn increases memory consumption at both compression and decompression time.
  • Self-extraction support - Can the archive be rolled into an executable file that provides ease of use to whomever needs to use it? (Also bear in mind you can only create a self-extractor for a single platform. Generally speaking a Windows self-extractor will not work on Linux by default, unless run through a compatibility layer like Wine).
  • File system attributes - Does the compressor store relevant file system metadata and permissions that may be worth preserving at point of extraction?

Generally speaking ZIP is the most ubiquitous format, but sizes over 4 GB aren't generally supported (if at all), security support is generally regarded as poor (the standard password can be compromised with a plain-text attack, and further encryption is generally implemented as an unofficial derivative of the format by commercial ZIP software vendors).

Apart from that, most other popular formats will have some form of support on all operating systems by installing more software.

My personal choice is 7-Zip, as it has great and flexible compression; despite it having a peculiar user interface on Windows. There are de-compressors for Linux and Mac OS X (although not GUI based as standard).

Related Question