Unique contribution of folder to disk usage

disk-usagehard link

I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).

When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.

One option I can think of would be to use du -s first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.

Is there an easier way?

After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:

I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.

The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).

Best Answer

You could do it by hand with GNU find:

find snapshot-dir -type d -printf '1 %b\n' -o -printf '%n %b %i\n' |
   awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
   END{print t*512}'

That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.

find prints:

1 <disk-usage> for directories
<link-count> <disk-usage> <inode-number> for other types of files.

We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.

From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).

You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).

If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:

find snapshot-dir* \( -path '*/*' -o -printf "%p:\n" \) \
  -type d -printf '1 %b\n' -o -printf '%n %b %i\n' |
   awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
        $1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
        END{print t*512}'

That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).

See numfmt to make the numbers more readable.

That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).

Related Solutions

Filesystem for archiving

If it's abot fsck slowness, did you try ext4? They added a few features to it that make fsck really quick by not looking at unused inodes:

Fsck is a very slow operation, especially the first step: checking all the inodes in the file system. In Ext4, at the end of each group's inode table will be stored a list of unused inodes (with a checksum, for safety), so fsck will not check those inodes. The result is that total fsck time improves from 2 to 20 times, depending on the number of used inodes (http://kerneltrap.org/Linux/Improving_fsck_Speeds_in_Ext4). It must be noticed that it's fsck, and not Ext4, who will build the list of unused inodes. This means that you must run fsck to get the list of unused inodes built, and only the next fsck run will be faster (you need to pass a fsck in order to convert a Ext3 filesystem to Ext4 anyway). There's also a feature that takes part in this fsck speed up - "flexible block groups" - that also speeds up filesystem operations.

How to `du` only the space used up by files that are not hardlinked elsewhere

Assuming there aren't internal hardlinks (that is, every file with more than 1 hardlink is linked from outside the tree), you can do:

find . -links -2 -print0 | du -c --files0-from=-

EDIT And here is what I sketched in the comment, applied. Only without du; kudos to @StephaneChazelas for noticing du is not necessary. Explanation at the end.

( find . -type d -printf '%k + ' ; \
  find . \! -type d -printf '%n\t%i\t%k\n' | \
    sort | uniq -c                         | \
    awk '$1 >= $2 { print $4 " +\\" }' ; \
  echo 0 ) | bc

What we do is to create a string with the disk usage (in KB) of every relevant file, separated by plus signs. Then we feed that big addition to bc.

The first find invocation does that for directories.

The second find prints link count, inode, and disk usage. We pass that list through sort | uniq -c to get a list of (number of appearances in the tree, link count, inode, disk usage).

We pass that list through awk, and, if the first field (# of appearances) is greater than or equal the second (# of hardlinks), meaning there aren't links to this file from outside the tree, then print the fourth field (disk usage) with a plus sign and a backslash attached.

Finally we output a 0, so the formula is syntactically correct (it would en in + otherwise) and pass it to bc. Phew.

(But I would use the simpler first method, if it gives a good enough answer.)

Best Answer

Related Solutions

Filesystem for archiving

How to `du` only the space used up by files that are not hardlinked elsewhere

Related Question