How to delete all duplicate hardlinks to a file

duplicatefileshard link

I've got a directory tree created by rsnapshot, which contains multiple snapshots of the same directory structure with all identical files replaced by hardlinks.

I would like to delete all those hardlink duplicates and keep only a single copy of every file (so I can later move all files into a sorted archive without having to touch identical files twice).

Is there a tool that does that?
So far I've only found tools that find duplicates and create hardlinks to replace them…
I guess I could list all files and their inode numbers and implement the deduplicating and deleting myself, but I don't want to reinvent the wheel here.

Best Answer

In the end it wasn't too hard to do this manually, based on Stéphane's and xenoid's hints and some prior experience with find.
I had to adapt a few commands to work with FreeBSD's non-GNU tools — GNU find has the -printf option that could have replaced the -exec stat, but FreeBSD's find doesn't have that.

# create a list of "<inode number> <tab> <full file path>"
find rsnapshots -type f -links +1 -exec stat -f '%i%t%R' {} + > inodes.txt

# sort the list by inode number (to have consecutive blocks of duplicate files)
sort -n inodes.txt > inodes.sorted.txt

# remove the first file from each block (we want to keep one link per inode)
awk -F'\t' 'BEGIN {lastinode = 0} {inode = 0+$1; if (inode == lastinode) {print $2}; lastinode = inode}' inodes.sorted.txt > inodes.to-delete.txt

# delete duplicates (watch out for special characters in the filename, and possibly adjust the read command and double quotes accordingly)
cat inodes.to-delete.txt | while read line; do rm -f "$line"; done

Related Solutions

Files – Replace Duplicate Files with Hardlinks

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the command line, compute MD5 checksums and find files with identical MD5. IF they are equal, do a real comparison if they are really equal, replace the second of two files with a hard link to the first one.

Finding Non-Binary Files – How to Find All Non-Binary Files

I'd use file and pipe the output into grep or awk to find text files, then extract just the filename portion of file's output and pipe that into xargs.

something like:

file * | awk -F: '/ASCII text/ {print $1}' | xargs -d'\n' -r flip -u

Note that the grep searches for 'ASCII text' rather than any just 'text' - you probably don't want to mess with Rich Text documents or unicode text files etc.

You can also use find (or whatever) to generate a list of files to examine with file:

find /path/to/files -type f -exec file {} + | \
  awk -F: '/ASCII text/ {print $1}' | xargs -d'\n' -r flip -u

The -d'\n' argument to xargs makes xargs treat each input line as a separate argument, thus catering for filenames with spaces and other problematic characters. i.e. it's an alternative to xargs -0 when the input source doesn't or can't generate NULL-separated output (such as find's -print0 option). According to the changelog, xargs got the -d/--delimiter option in Sep 2005 so should be in any non-ancient linux distro (I wasn't sure, which is why I checked - I just vaguely remembered it was a "recent" addition).

Note that a linefeed is a valid character in filenames, so this will break if any filenames have linefeeds in them. For typical unix users, this is pathologically insane, but isn't unheard of if the files originated on Mac or Windows machines.

Also note that file is not perfect. It's very good at detecting the type of data in a file but can occasionally get confused.

I have used numerous variations of this method many times in the past with success.

Best Answer

Related Solutions

Files – Replace Duplicate Files with Hardlinks

Finding Non-Binary Files – How to Find All Non-Binary Files

Related Question