Linux – Why doesn’t Gzip compression eliminate duplicate chunks of data

I just did a little experiment where I created a tar archive with duplicate files to see if it would be compressed, to my awe, it was not! Details follow (results indented for reading pleasure):

$ dd if=/dev/urandom bs=1M count=1 of=a
  1+0 records in
  1+0 records out
  1048576 bytes (1.0 MB) copied, 0.114354 s, 9.2 MB/s
$ cp a b
$ ln a c
$ ll
  total 3072
  -rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 a
  -rw-r--r-- 1 guido guido 1048576 Sep 24 15:51 b
  -rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 c
$ tar -c * -f test.tar
$ ls -l test.tar 
  -rw-r--r-- 1 guido guido 2109440 Sep 24 15:51 test.tar
$ gzip test.tar 
$ ls -l test.tar.gz 
  -rw-r--r-- 1 guido guido 2097921 Sep 24 15:51 test.tar.gz
$

First I created a 1MiB file of random data (a). Then I copied it to a file b and also harlinked it to c. When creating the tarball, tar was apparently aware of the hardlink, since the tarball was only ~2MiB and not ~3Mib.

Now I expected gzip to reduce the size of the tarball to ~1MiB since a and b are duplicates, and there should be 1MiB of continuous data repeated inside the tarball, yet this didn't occur.

Why is this? And how could I compress the tarball efficiently in these cases?

Best Answer

Gzip gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. It's a lossless data compression algorithm that works by transforming the input stream into compressed symbols using a dictionary built on-the-fly and watching for duplicates. But it can't find duplicates separated by more than 32K. Expecting it to spot duplicates 1MB apart isn't realistic.

Best Answer

Related Solutions

The maximum compression ratio of gzip

Linux – How to know the gzip compression level

Related Question