Compression program showing live compression ratio

compressionperformance

Some compression programs can show information (like compression ratio or time and size totals) while performing the task, like xz -v:

--- %   2,580.2 KiB / 6,552.0 KiB = 0.394   1.2 MiB/s       0:05

While compressing a big file I would like to know the compression ratio in mid-task, so that I can stop the process if the compression ratio is low and leave it uncompressed.

Are there any other programs with this feature? (xz has a high compression ratio but is slow)

Best Answer

The generic way to do this is to use something like pv to monitor both the input and output size of the compression program. For example :

$ pv -cpterba -N in /dev/urandom | gzip | pv -cpterba -N out > /dev/null 
      out:  956MiB 0:00:42 [23.1MiB/s] [22.8MiB/s] [                           <=>        ]
       in:  956MiB 0:00:42 [23.1MiB/s] [22.8MiB/s] [                           <=>        ]

It's easy enough to see above that the output size is the same as the input size—as expected when attempting to compress random data.

If instead we try on a file that compresses really well:

$ pv -cpterba -N in /dev/zero | gzip | pv -cpterba -N out > /dev/null 
      out: 2.62MiB 0:00:25 [ 109KiB/s] [ 107KiB/s] [                   <=>                ]
       in: 2.65GiB 0:00:25 [ 110MiB/s] [ 108MiB/s] [                   <=>                ]

The output size is 2.62MiB, the input is 2.65GiB—3 orders of magnitude larger.

As a side benefit, if used on a normal file, pv will give you an ETA:

$ pv -cpterba -N in debian-8.2.0-amd64-DVD-1.iso | gzip | pv -cpterba -N out > /dev/null 
      out:  578MiB 0:00:27 [22.1MiB/s] [21.4MiB/s] [                  <=>                 ]
       in:  595MiB 0:00:27 [22.1MiB/s] [  22MiB/s] [==>                   ] 15% ETA 0:02:25

The Jessie DVD image is mostly compressed files, so it doesn't compress so well, but it'd take another two and a half minutes to complete.

You can also use pv -d to monitor an already-running process—if you apply that to a running compressor, it will tell you where it is on the input vs. the output file, again letting you quickly see the ratio:

$ pv -pterba -d "$(pidof gzip)"
   3:/var/tmp/mp3s.tar:  911MiB 0:00:44 [  20MiB/s] [19.9MiB/s] [>         ]  9% ETA 0:07:35
   4:/var/tmp/mp3s.tar.gz:  906MiB 0:00:44 [  20MiB/s] [19.8MiB/s] [                <=>   ] 

Tar files of MP3s do not compress well, either.

Note: Many compressors work on a block-by-block basis. That's why you may see things like the transfer rate spiking then being 0, repeat. You need to let the compressor run for a bit before you can get any real idea of the expected ratio. Keep in mind that right after a spike, it's probably read in a block, but not yet written the compressed version—but if you've already waited through 10 blocks, that's at most a 10% error.

(The pv options I'm using: -p to turn on the progress bar; -t to turn on the elapsed time; -e to turn on the ETA; -r to show the transfer rate; -b to turn on the byte counter; -c to make multiple pvs in a pipe work; -N to set the labels).

Related Question