Tell gzip/bzip2/7z/etc not to compress already-compressed files

command linecompressiontar

I'm tarring up /home, and piping it through bzip2. However, I've got lots of already-compressed files out there (.jpg, .mp4, .mkv, .webm, etc) which bzip2 shouldn't try to compress.

Are there any CLI compressors out there that are smart enough (either via libmagic or the user enumerating extensions) not to try to back up un- or minimally-compressible files?

A similar question was asked a few years ago, but don't know if there have been any updates since then.
Can I command 7z to skip compression (but not inclusion) of specific files while compressing a directory with its subs?

Best Answer

The way you are doing this, with compressing a .tar file the answer is for sure no.

Whatever you use for compressing the .tar file, it doesn't know about the contents of the file, it just sees a binary stream, and whether parts of that stream are uncompressable, or minimally compressible, there is no way this is known. Don't be confused by the options for the tar command to do the compression, tar --create --xz --file some.tar file1 is as "dumb" as knowing about the stream contents as doing tar --create file1 | xz > some.tar is.

You can do multiple things:

  1. you switch to some container format other than .tar which allows you to compress on an individual basis, but this is unfavourable if you have lots of small files in one directory that have similar patterns (as they get compressed individually). The zip format is an example that would work.
  2. you compress the files, if appropriate before putting them in the tar file. This can be done transparently with e.g. the python tarfile and bzip2 modules This also has the disadvantages of point 1. And there is no straight extraction from the tar file as some files will come out compressed that might not need decompression (as the already were compressed before backup).
  3. Use tar as is and live with the fact that th happens and select a not so high compression for gzip/bzip2/xz so that they will not try too hard to compress the stream, thereby not wasting time on trying to get another 0.5% compression which is not going to happen.

You might want to look at the results of paralleling xz compression (not specific to tar files), to see some results of trying to speed up xz as published on my blog

Related Question