Best compression of similar files

compression

I've got a few full old backups of things like binary database dumps. Obviously, they don't differ much so doing full backups is not the smartest idea here. For now, I'm looking for a compression program capable of taking an advantage from the fact, that most of the files have a similar content.

Best Answer

If you first tar the files (using tar cvf my_backup.tar <file list...>) then any compression tool will do a good job as they will see the data as one big file.

So just tar the files, and then put them in a zip, 7-zip, bzip2, etc. From the tar file, you can try the different compression algorithm and see which one performs best.

Related Solutions

Linux – How to Compress Zip Files with Higher Compression

You cannot improve the compression ratio, without decompressing the data. You don't have to extract all of the zip files before compressing them, but I would recommend uncompressing one whole zip file before re-compressing.

It is possible to recompress the files in a zip file one at a time and re-adding them before going to the next file contained in the zip file. This requires N rewrites of the zip file for a zip file containing N files. It is much more efficient to extract the N files and generate the new zipfile in one go, compressing all files with -9.

Compression program showing live compression ratio

The generic way to do this is to use something like pv to monitor both the input and output size of the compression program. For example :

$ pv -cpterba -N in /dev/urandom | gzip | pv -cpterba -N out > /dev/null 
      out:  956MiB 0:00:42 [23.1MiB/s] [22.8MiB/s] [                           <=>        ]
       in:  956MiB 0:00:42 [23.1MiB/s] [22.8MiB/s] [                           <=>        ]

It's easy enough to see above that the output size is the same as the input size—as expected when attempting to compress random data.

If instead we try on a file that compresses really well:

$ pv -cpterba -N in /dev/zero | gzip | pv -cpterba -N out > /dev/null 
      out: 2.62MiB 0:00:25 [ 109KiB/s] [ 107KiB/s] [                   <=>                ]
       in: 2.65GiB 0:00:25 [ 110MiB/s] [ 108MiB/s] [                   <=>                ]

The output size is 2.62MiB, the input is 2.65GiB—3 orders of magnitude larger.

As a side benefit, if used on a normal file, pv will give you an ETA:

$ pv -cpterba -N in debian-8.2.0-amd64-DVD-1.iso | gzip | pv -cpterba -N out > /dev/null 
      out:  578MiB 0:00:27 [22.1MiB/s] [21.4MiB/s] [                  <=>                 ]
       in:  595MiB 0:00:27 [22.1MiB/s] [  22MiB/s] [==>                   ] 15% ETA 0:02:25

The Jessie DVD image is mostly compressed files, so it doesn't compress so well, but it'd take another two and a half minutes to complete.

You can also use pv -d to monitor an already-running process—if you apply that to a running compressor, it will tell you where it is on the input vs. the output file, again letting you quickly see the ratio:

$ pv -pterba -d "$(pidof gzip)"
   3:/var/tmp/mp3s.tar:  911MiB 0:00:44 [  20MiB/s] [19.9MiB/s] [>         ]  9% ETA 0:07:35
   4:/var/tmp/mp3s.tar.gz:  906MiB 0:00:44 [  20MiB/s] [19.8MiB/s] [                <=>   ]

Tar files of MP3s do not compress well, either.

Note: Many compressors work on a block-by-block basis. That's why you may see things like the transfer rate spiking then being 0, repeat. You need to let the compressor run for a bit before you can get any real idea of the expected ratio. Keep in mind that right after a spike, it's probably read in a block, but not yet written the compressed version—but if you've already waited through 10 blocks, that's at most a 10% error.

(The pv options I'm using: -p to turn on the progress bar; -t to turn on the elapsed time; -e to turn on the ETA; -r to show the transfer rate; -b to turn on the byte counter; -c to make multiple pvs in a pipe work; -N to set the labels).

Best Answer

Related Solutions

Linux – How to Compress Zip Files with Higher Compression

Compression program showing live compression ratio

Related Question