Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?
Duplicate Files – How to Find and Remove Duplicate Files
duplicate filesfiles
Related Solutions
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:
Traverse all directories named on the command line, compute MD5 checksums and find files with identical MD5. IF they are equal, do a real comparison if they are really equal, replace the second of two files with a hard link to the first one.
First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:
I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard
links using the -L
option, but I don't have a Debian installation to verify
this.
If you do not have a version with the -L
option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.
fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done
The above command will find all duplicate files in "path" and replace them with
hardlinks. You can verify this by running ls -ilR
and looking at the inode
number. Here is a samle with ten identical files:
$ ls -ilR
total 20
3094308 -rw------- 1 username group 5 Sep 14 17:21 file
3094311 -rw------- 1 username group 5 Sep 14 17:21 file2
3094312 -rw------- 1 username group 5 Sep 14 17:21 file3
3094313 -rw------- 1 username group 5 Sep 14 17:21 file4
3094314 -rw------- 1 username group 5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory
./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5
All the files have separate inode numbers, making them separate files. Now lets deduplicate them:
$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:24 subdirectory
./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5
The files now all have the same inode number, meaning they all point to the same physical data on disk.
I hope this solves your problem or at least points you in the right direction!
Best Answer
I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this: