Tell gzip/bzip2/7z/etc not to compress already-compressed files

command linecompressiontar

I'm tarring up /home, and piping it through bzip2. However, I've got lots of already-compressed files out there (.jpg, .mp4, .mkv, .webm, etc) which bzip2 shouldn't try to compress.

Are there any CLI compressors out there that are smart enough (either via libmagic or the user enumerating extensions) not to try to back up un- or minimally-compressible files?

A similar question was asked a few years ago, but don't know if there have been any updates since then.
Can I command 7z to skip compression (but not inclusion) of specific files while compressing a directory with its subs?

Best Answer

The way you are doing this, with compressing a .tar file the answer is for sure no.

Whatever you use for compressing the .tar file, it doesn't know about the contents of the file, it just sees a binary stream, and whether parts of that stream are uncompressable, or minimally compressible, there is no way this is known. Don't be confused by the options for the tar command to do the compression, tar --create --xz --file some.tar file1 is as "dumb" as knowing about the stream contents as doing tar --create file1 | xz > some.tar is.

You can do multiple things:

you switch to some container format other than .tar which allows you to compress on an individual basis, but this is unfavourable if you have lots of small files in one directory that have similar patterns (as they get compressed individually). The zip format is an example that would work.
you compress the files, if appropriate before putting them in the tar file. This can be done transparently with e.g. the python tarfile and bzip2 modules This also has the disadvantages of point 1. And there is no straight extraction from the tar file as some files will come out compressed that might not need decompression (as the already were compressed before backup).
Use tar as is and live with the fact that th happens and select a not so high compression for gzip/bzip2/xz so that they will not try too hard to compress the stream, thereby not wasting time on trying to get another 0.5% compression which is not going to happen.

You might want to look at the results of paralleling xz compression (not specific to tar files), to see some results of trying to speed up xz as published on my blog

Related Solutions

Disk usage inside archives, like ncdu

You can use ncdu itself!

This shows the uncompressed sizes of the files.
In the case you say you care about, namely many uncompressible files, it should reflect what you need pretty well:

To make the file sizes accessible to ncdu, they need to be in a file system. So we need to mount the archive as a file system.

We use a fuse user-space filesystem implementation, archivemount:

Install the fuse file system:

sudo apt-get install archivemount

mkdir a directory, mount the archive to it, cd into it, and run ncdu:

$ mkdir bash-4.3-mount
$ archivemount bash-4.3.tar.gz bash-4.3-mount
$ cd bash-4.3-mount
$ ncdu

Now you can use ncdu just normally:

ncdu 1.10 ~ Use the arrow keys to navigate, press ? for help                     
--- /tmp/archivedutest/bash-4.3-mount/bash-4.3/lib ----------------
                        /..                                                      
    1.2MiB [##########] /readline
  343.0KiB [##        ] /sh
  316.5KiB [##        ] /intl
  104.5KiB [          ] /glob
   97.0KiB [          ] /malloc
   32.0KiB [          ] /termcap
   22.0KiB [          ] /tilde

 Total disk usage:   2.1MiB  Apparent size:   2.0MiB  Items: 251

Now, what you are really interested in is the compressed size of the files, not uncompressed: You want to see which files take up the most space in the actual archive.

Strictly speaking, that's not possible because the archive is compressed as a whole. An individual file has no "compressed size".

So the compressed size of individual files can only be approximated.
One approximation would be the size of individually compressed files.
Another would be a fraction of the compressed size assuming all files compress by the same ratio. There are certainly other ways.

The first seems to be ok. To implement it, there is no way around actually unpacking and recompressing the individual files, so I see no reason to not just do that, unpack to the filesystem, and use ncdu on the files.

Debian – How to debug: tar: A lone zero block

Your file is either truncated or corrupted, so xz can't get to the end of the data. tar complains because the archive stops in the middle, which is logical since xz didn't manage to read the whole data.

Run the following commands to check where the problem is:

cat /var/www/bak/db/2017-05-20-1200_mysql.tar.xz >/dev/null
xzcat /var/www/bak/db/2017-05-20-1200_mysql.tar.xz >/dev/null

If cat complains then the file is corrupted on the disk and the operating system detected the corruption. Check the kernel logs for more information; usually the disk needs to be replaced at this point. If only xz complains then the OS didn't detect any corruption but the file is nevertheless not valid (either corrupted or truncated). Either way, you aren't going to be able to recover this file. You'll need to get it back from your offline backups.

Best Answer

Related Solutions

Disk usage inside archives, like ncdu

Debian – How to debug: tar: A lone zero block

Related Question