You can use ncdu
itself!
This shows the uncompressed sizes of the files.
In the case you say you care about, namely many uncompressible files, it should reflect what you need pretty well:
To make the file sizes accessible to ncdu
, they need to be in a file system. So we need to mount the archive as a file system.
We use a fuse user-space filesystem implementation, archivemount
:
Install the fuse file system:
sudo apt-get install archivemount
mkdir
a directory, mount
the archive to it, cd
into it, and run ncdu
:
$ mkdir bash-4.3-mount
$ archivemount bash-4.3.tar.gz bash-4.3-mount
$ cd bash-4.3-mount
$ ncdu
Now you can use ncdu
just normally:
ncdu 1.10 ~ Use the arrow keys to navigate, press ? for help
--- /tmp/archivedutest/bash-4.3-mount/bash-4.3/lib ----------------
/..
1.2MiB [##########] /readline
343.0KiB [## ] /sh
316.5KiB [## ] /intl
104.5KiB [ ] /glob
97.0KiB [ ] /malloc
32.0KiB [ ] /termcap
22.0KiB [ ] /tilde
Total disk usage: 2.1MiB Apparent size: 2.0MiB Items: 251
Now, what you are really interested in is the compressed size of the files, not uncompressed: You want to see which files take up the most space in the actual archive.
Strictly speaking, that's not possible because the archive is compressed as a whole. An individual file has no "compressed size".
So the compressed size of individual files can only be approximated.
One approximation would be the size of individually compressed files.
Another would be a fraction of the compressed size assuming all files compress by the same ratio. There are certainly other ways.
The first seems to be ok. To implement it, there is no way around actually unpacking and recompressing the individual files, so I see no reason to not just do that, unpack to the filesystem, and use ncdu on the files.
Your file is either truncated or corrupted, so xz
can't get to the end of the data. tar
complains because the archive stops in the middle, which is logical since xz
didn't manage to read the whole data.
Run the following commands to check where the problem is:
cat /var/www/bak/db/2017-05-20-1200_mysql.tar.xz >/dev/null
xzcat /var/www/bak/db/2017-05-20-1200_mysql.tar.xz >/dev/null
If cat
complains then the file is corrupted on the disk and the operating system detected the corruption. Check the kernel logs for more information; usually the disk needs to be replaced at this point. If only xz
complains then the OS didn't detect any corruption but the file is nevertheless not valid (either corrupted or truncated). Either way, you aren't going to be able to recover this file. You'll need to get it back from your offline backups.
Best Answer
The way you are doing this, with compressing a
.tar
file the answer is for sure no.Whatever you use for compressing the
.tar
file, it doesn't know about the contents of the file, it just sees a binary stream, and whether parts of that stream are uncompressable, or minimally compressible, there is no way this is known. Don't be confused by the options for thetar
command to do the compression,tar --create --xz --file some.tar file1
is as "dumb" as knowing about the stream contents as doingtar --create file1 | xz > some.tar
is.You can do multiple things:
.tar
which allows you to compress on an individual basis, but this is unfavourable if you have lots of small files in one directory that have similar patterns (as they get compressed individually). The zip format is an example that would work.tarfile
andbzip2
modules This also has the disadvantages of point 1. And there is no straight extraction from the tar file as some files will come out compressed that might not need decompression (as the already were compressed before backup).gzip
/bzip2
/xz
so that they will not try too hard to compress the stream, thereby not wasting time on trying to get another 0.5% compression which is not going to happen.You might want to look at the results of paralleling
xz
compression (not specific to tar files), to see some results of trying to speed upxz
as published on my blog