Disk usage inside archives, like ncdu

compressiondisk-usagencursessoftware-rec

I am a fond user of the ncdu utility to figure out how space is used within a directory.

However, I have a use case where I am trying to choose which folders to backup and which folders not to backup, and the backups will be compressed (as a .tar.xz archive, but I suppose .tar.gz would yield the same result for what I have in mind). So, intuitively, I do not care that much about files that are large but will compress well (e.g., email archives), whereas I care more about files that are relatively small but will not compress at all (e.g., JPG pictures). I want to see files and folders sorted by their compressed size, not their actual uncompressed size.

A natural solution would be to compress all files, and then have an ncdu-like tool that would operate on the archive to tell me how folders take up space in the archive.

Is there any such utility?

I am OK with GUI programs (although I would prefer text-based ones), and I am OK with methods that would only work for a different compression algorithm as I imagine they would still yield useful results (e.g., replicate the hierarchy in a filesystem with built-in compression/deduplication).

Best Answer

You can use ncdu itself!

This shows the uncompressed sizes of the files.
In the case you say you care about, namely many uncompressible files, it should reflect what you need pretty well:


To make the file sizes accessible to ncdu, they need to be in a file system. So we need to mount the archive as a file system.

We use a fuse user-space filesystem implementation, archivemount:

Install the fuse file system:

sudo apt-get install archivemount

mkdir a directory, mount the archive to it, cd into it, and run ncdu:

$ mkdir bash-4.3-mount
$ archivemount bash-4.3.tar.gz bash-4.3-mount
$ cd bash-4.3-mount
$ ncdu


Now you can use ncdu just normally:

ncdu 1.10 ~ Use the arrow keys to navigate, press ? for help                     
--- /tmp/archivedutest/bash-4.3-mount/bash-4.3/lib ----------------
                        /..                                                      
    1.2MiB [##########] /readline
  343.0KiB [##        ] /sh
  316.5KiB [##        ] /intl
  104.5KiB [          ] /glob
   97.0KiB [          ] /malloc
   32.0KiB [          ] /termcap
   22.0KiB [          ] /tilde

 Total disk usage:   2.1MiB  Apparent size:   2.0MiB  Items: 251                 



Now, what you are really interested in is the compressed size of the files, not uncompressed: You want to see which files take up the most space in the actual archive.

Strictly speaking, that's not possible because the archive is compressed as a whole. An individual file has no "compressed size".

So the compressed size of individual files can only be approximated.
One approximation would be the size of individually compressed files.
Another would be a fraction of the compressed size assuming all files compress by the same ratio. There are certainly other ways.

The first seems to be ok. To implement it, there is no way around actually unpacking and recompressing the individual files, so I see no reason to not just do that, unpack to the filesystem, and use ncdu on the files.

Related Question