Finding duplicate files and replace them with symlinks

duplicate filesfdupessymlink

I'm trying to find a way to check inside a given directory for duplicate files (even with different names) and replace them with symlinks pointing to the first occurrence. I've tried with fdupes but it just lists those duplicates.
That's the context: I'm customizing an icon theme to my liking, and I've found that many icons, even if they have different names and different locations inside their parent folder, and are used for different purposes, basically are just the same picture. Since applying the same modification twenty or thirty times is redundant when just one is really necessary, I want to keep just one image and symlink all the others.

As an example, if I run fdupes -r ./ inside the directory testdir, it might return to me the following results:

./file1.png
./file2.png
./subdir1/anotherfile.png
./subdir1/subdir2/yetanotherfile.png

Given this output, I'd like to keep just the file file1.png, delete all the others and replace them with symlinks pointing to it, while maintaining all original file names. So file2.png will retain its name, but will become a link to file1.png instead of being a duplicate.

Those links should not point to an absolute path, but should be relative to the parent testdir directory; i.e. yetanotherfile.png will be point to ../../file1.png, not to /home/testuser/.icons/testdir/file1.png

I'm interested both in solutions that involve a GUI and CLI. It is not mandatory to use fdupes I've cited it because it is a tool that I know, but I'm open to solutions that use other tools as well.

I'm pretty sure that a bash script to handle all of this should not be that difficult to create, but I'm not expert enough to find out how to write it myself.

Best Answer

First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:

I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard links using the -L option, but I don't have a Debian installation to verify this.

If you do not have a version with the -L option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.

fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done

The above command will find all duplicate files in "path" and replace them with hardlinks. You can verify this by running ls -ilR and looking at the inode number. Here is a samle with ten identical files:

$ ls -ilR

total 20
3094308 -rw------- 1 username group  5 Sep 14 17:21 file
3094311 -rw------- 1 username group  5 Sep 14 17:21 file2
3094312 -rw------- 1 username group  5 Sep 14 17:21 file3
3094313 -rw------- 1 username group  5 Sep 14 17:21 file4
3094314 -rw------- 1 username group  5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory

./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5

All the files have separate inode numbers, making them separate files. Now lets deduplicate them:

$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group  5 Sep 14 17:21 file
3094308 -rw------- 10 username group  5 Sep 14 17:21 file2
3094308 -rw------- 10 username group  5 Sep 14 17:21 file3
3094308 -rw------- 10 username group  5 Sep 14 17:21 file4
3094308 -rw------- 10 username group  5 Sep 14 17:21 file5
3094315 drwx------  1 username group 48 Sep 14 17:24 subdirectory

./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5

The files now all have the same inode number, meaning they all point to the same physical data on disk.

I hope this solves your problem or at least points you in the right direction!

Related Question