Unique contribution of folder to disk usage

disk-usagehard link

I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).

When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.

One option I can think of would be to use du -s first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.

Is there an easier way?


After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:

I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.

The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).

Best Answer

You could do it by hand with GNU find:

find snapshot-dir -type d -printf '1 %b\n' -o -printf '%n %b %i\n' |
   awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
   END{print t*512}'

That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.

find prints:

  • 1 <disk-usage> for directories
  • <link-count> <disk-usage> <inode-number> for other types of files.

We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.

From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).

You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).

If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:

find snapshot-dir* \( -path '*/*' -o -printf "%p:\n" \) \
  -type d -printf '1 %b\n' -o -printf '%n %b %i\n' |
   awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
        $1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
        END{print t*512}'

That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).

See numfmt to make the numbers more readable.

That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).