Files – Replace Duplicate Files with Hardlinks

deduplicationduplicate filesfileshard link

I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.

Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.

Best Answer

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the command line, compute MD5 checksums and find files with identical MD5. IF they are equal, do a real comparison if they are really equal, replace the second of two files with a hard link to the first one.

Related Solutions

Faster way to rename duplicate files (identified by fdupes) in another directory

In my experience fdupes can be inconsistent in the order that it outputs files (I have had my own problems using the --delete option). This should be fairly robust as it doesn't require the files to be in a specific order (as as long as there are always two dupes in different folders):

# note no trailing slash
source_dir=/home/articles
target_dir=/external/articles

fdupes "$target_dir" "$source_dir" |
  while IFS= read file; do
    case "$file" in
      "$source_dir/"*)
         source=${file##*/}
         ;;
      "$target_dir/"*)
         target=$file
         ;;
      '')
         if [ "$source" ] && [ "$target" ]; then
           echo mv -i "$target" "$target_dir/$source"
         fi
         unset source target
         ;;
    esac
  done

This will just print out the mv commands, remove the echo when you are sure you have what you want. Also the -i option for mv will prompt you if it is going to overwrite anything.

Finding duplicate files and replace them with symlinks

First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:

I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard links using the -L option, but I don't have a Debian installation to verify this.

If you do not have a version with the -L option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.

fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done

The above command will find all duplicate files in "path" and replace them with hardlinks. You can verify this by running ls -ilR and looking at the inode number. Here is a samle with ten identical files:

$ ls -ilR

total 20
3094308 -rw------- 1 username group  5 Sep 14 17:21 file
3094311 -rw------- 1 username group  5 Sep 14 17:21 file2
3094312 -rw------- 1 username group  5 Sep 14 17:21 file3
3094313 -rw------- 1 username group  5 Sep 14 17:21 file4
3094314 -rw------- 1 username group  5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory

./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5

All the files have separate inode numbers, making them separate files. Now lets deduplicate them:

$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group  5 Sep 14 17:21 file
3094308 -rw------- 10 username group  5 Sep 14 17:21 file2
3094308 -rw------- 10 username group  5 Sep 14 17:21 file3
3094308 -rw------- 10 username group  5 Sep 14 17:21 file4
3094308 -rw------- 10 username group  5 Sep 14 17:21 file5
3094315 drwx------  1 username group 48 Sep 14 17:24 subdirectory

./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5

The files now all have the same inode number, meaning they all point to the same physical data on disk.

I hope this solves your problem or at least points you in the right direction!

Best Answer

Related Solutions

Faster way to rename duplicate files (identified by fdupes) in another directory

Finding duplicate files and replace them with symlinks

Related Question