Linux – Why doesn’t Gzip compression eliminate duplicate chunks of data

compressiongziplinux

I just did a little experiment where I created a tar archive with duplicate files to see if it would be compressed, to my awe, it was not! Details follow (results indented for reading pleasure):

$ dd if=/dev/urandom bs=1M count=1 of=a
  1+0 records in
  1+0 records out
  1048576 bytes (1.0 MB) copied, 0.114354 s, 9.2 MB/s
$ cp a b
$ ln a c
$ ll
  total 3072
  -rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 a
  -rw-r--r-- 1 guido guido 1048576 Sep 24 15:51 b
  -rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 c
$ tar -c * -f test.tar
$ ls -l test.tar 
  -rw-r--r-- 1 guido guido 2109440 Sep 24 15:51 test.tar
$ gzip test.tar 
$ ls -l test.tar.gz 
  -rw-r--r-- 1 guido guido 2097921 Sep 24 15:51 test.tar.gz
$ 

First I created a 1MiB file of random data (a). Then I copied it to a file b and also harlinked it to c. When creating the tarball, tar was apparently aware of the hardlink, since the tarball was only ~2MiB and not ~3Mib.

Now I expected gzip to reduce the size of the tarball to ~1MiB since a and b are duplicates, and there should be 1MiB of continuous data repeated inside the tarball, yet this didn't occur.

Why is this? And how could I compress the tarball efficiently in these cases?

Best Answer

Gzip gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. It's a lossless data compression algorithm that works by transforming the input stream into compressed symbols using a dictionary built on-the-fly and watching for duplicates. But it can't find duplicates separated by more than 32K. Expecting it to spot duplicates 1MB apart isn't realistic.

Related Question