Compress Similar Files Efficiently

compression

I frequently have the need to compress files that are very similar to each other.

Currently I use 7Zip, which compresses a 16GB file down to 1.2GB in about 35 minutes using 8 cores with Ultra settings.

It seems to me that much of that time is spent computing the dictionary to use for compression. Since the files are highly similar, the dictionary actually used is likely also similar.

Is there a Windows-based compression tool (7Zip with an option I'm not aware of, or a different tool) that can save the dictionary and reuse that saved dictionary for subsequent files?

Is there a better way to approach the problem of maintaining a compression ratio similar to what I have, while compressing significantly faster?

Best Answer

The Lempel-Ziv-Welch (LZW) compression algorithm is inherently computationally intensive, with the majority of the work itself being actually computing the dictionary. This is literally just how LZW works.

The algorithm itself adds one new dictionary entry for every next "symbol" it scans, and thus during every single iteration, a new entry is added to the dictionary. In effect, the dictionary becomes the compressed copy of the file, and thus is actually the only thing the LZW compression spends any significant time computing in the first place.


If you used something like Huffman encoding, dictionary re-use would indeed be possible (at the expense of a possibly sub-optimal compression rate/size). However, most modern compression algorithms & tools use the LZW algorithm for efficiency and speed (Huffman compression would require two passes over the data [one to generate the Huffman tree/table, another to actually compress the data], whereas LZW can be completed in a single pass).