Finding duplicate files and replace them with symlinks

duplicate filesfdupessymlink

I'm trying to find a way to check inside a given directory for duplicate files (even with different names) and replace them with symlinks pointing to the first occurrence. I've tried with fdupes but it just lists those duplicates.
That's the context: I'm customizing an icon theme to my liking, and I've found that many icons, even if they have different names and different locations inside their parent folder, and are used for different purposes, basically are just the same picture. Since applying the same modification twenty or thirty times is redundant when just one is really necessary, I want to keep just one image and symlink all the others.

As an example, if I run fdupes -r ./ inside the directory testdir, it might return to me the following results:

./file1.png
./file2.png
./subdir1/anotherfile.png
./subdir1/subdir2/yetanotherfile.png

Given this output, I'd like to keep just the file file1.png, delete all the others and replace them with symlinks pointing to it, while maintaining all original file names. So file2.png will retain its name, but will become a link to file1.png instead of being a duplicate.

Those links should not point to an absolute path, but should be relative to the parent testdir directory; i.e. yetanotherfile.png will be point to ../../file1.png, not to /home/testuser/.icons/testdir/file1.png

I'm interested both in solutions that involve a GUI and CLI. It is not mandatory to use fdupes I've cited it because it is a tool that I know, but I'm open to solutions that use other tools as well.

I'm pretty sure that a bash script to handle all of this should not be that difficult to create, but I'm not expert enough to find out how to write it myself.

Best Answer

First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:

I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard links using the -L option, but I don't have a Debian installation to verify this.

If you do not have a version with the -L option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.

fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done

The above command will find all duplicate files in "path" and replace them with hardlinks. You can verify this by running ls -ilR and looking at the inode number. Here is a samle with ten identical files:

$ ls -ilR

total 20
3094308 -rw------- 1 username group  5 Sep 14 17:21 file
3094311 -rw------- 1 username group  5 Sep 14 17:21 file2
3094312 -rw------- 1 username group  5 Sep 14 17:21 file3
3094313 -rw------- 1 username group  5 Sep 14 17:21 file4
3094314 -rw------- 1 username group  5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory

./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5

All the files have separate inode numbers, making them separate files. Now lets deduplicate them:

$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group  5 Sep 14 17:21 file
3094308 -rw------- 10 username group  5 Sep 14 17:21 file2
3094308 -rw------- 10 username group  5 Sep 14 17:21 file3
3094308 -rw------- 10 username group  5 Sep 14 17:21 file4
3094308 -rw------- 10 username group  5 Sep 14 17:21 file5
3094315 drwx------  1 username group 48 Sep 14 17:24 subdirectory

./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5

The files now all have the same inode number, meaning they all point to the same physical data on disk.

I hope this solves your problem or at least points you in the right direction!

Related Solutions

Linux – How to efficiently dereference all symlinks in `find` output filenames

What you're asking for doesn't make much sense in the general case, so it's not surprising that find has no provision for it.

A symlink with a relative target is relative to the path of the symlink. So for instance, if by traversing a directory by following symlinks, find encounters a/b/c/d and a, a/b, a/b/c are all relative or absolute symlinks (or symlinks to paths with symlink components), what should it do?

If you're looking for a find predicate or a GNU -printf % directive that expands to a symlink-free path to the file relative to the current directory or any directory, I'm afraid there's none.

If you're on Linux, you can get the absolute path of those files with:

find -L foo -type f -exec readlink -f {} \;

As you found out, there exists at least one realpath command which accepts more than one path argument which in combination with the standard -exec cmd {} + syntax is going to be a lot more efficient since it's running as few realpath commands as necessary:

find -L foo -type f -exec realpath {} +

find -L foo -type f -print0 | xargs -r0 realpath

might be quicker as if more than one realpath command is needed, find can keep on looking for more files while the first realpath starts working which even on a single processor system might make it more efficient.

-print0 and xargs -r0 are not standard, come from GNU but are found in a number of other implementations like most modern BSDs.

Zsh has builtin support for it:

print -rl foo/***/*(-.:A)

If you don't care about the sorting order, you can disable sorting and make it a bit more efficient with:

print -rl foo/***/*(-.oN:A)

If you want to convert those to relative paths to the current directory, you could have a look at that SO question.

If you know that all those files have an absolute canonical path (whose none of the components are symlinks) inside the current directory, you can simplify it to (still with zsh):

files=(foo/***/*(-.:A))
print -rl -- ${files#$PWD/}

Though short and convenient, and works whatever character filenames contain, I doubt it would faster than find + realpath.

With the Debian realpath and GNU tools, you can do:

cd -P .
find -L foo -type f -exec realpath -z {} + | 
  gawk -v p="$PWD" -v l="${#PWD}" -v RS='\0' -vORS='\0' '
    substr($0, 1, l+1) == p "/" {$0 = substr($0, l+2)}; 1' |
  xargs -r0 whatever you want to do with them

As I realise now, there's now a realpath in recent versions of GNU coreutils, which has the exact feature you're looking for, so it's just a matter of

find -L foo -type f -print0 |
  xargs -r0 realpath -z --relative-base . |
  xargs -r0 whatever you want to do with them

(use --relative-to . instead of --relative-base . if you want relative paths even for files whose symlink free path doesn't reside below the current working directory).

Why is there a mix of symlinks and hardlinks in /bin

Why use hardlinks vs. Symbolic links

There are primarily 3 advantages of using hardlinks over symbolic links in this scenario.

Hard links

With a hard link, the link points to the inode directly.
Hard links are like having multiple copies of the executable but only using the disk space of one.
You can rename either branch of the hard link without breaking anything.

Symbolic links

The link points to the object (which then in-turn points to the inode).
They can span filesystems, whereas hardlinks cannot.

Advantages of linking in general

These links exist because many executables behave differently based on how they were called. For example the 2 commands bzless and bzmore are actually a single executable, bzmore. The executable will behave differently depending on which names was used to invoke it.

This is done for a variety of reasons. Here are some of the more obvious ones:

Easier to develop a single executable rather than many
Saves disk space
Easier to deploy

Why are both being used?

The choice of either, in this particular application, is moot. Either can facilitate the feature of acting as an alias so that a single executable can be overloaded. That's really the key feature that is getting exploited by the developers of the various programs here.

In looking at the FHS (Filesystem Hierarchy Standard) even specifies it this way, that it can be either.

excerpt

If /bin/sh is not a true Bourne shell, it must be a hard or symbolic link to the real shell command.

The rationale behind this is because sh and bash mightn't necessarily behave in the same manner. The use of a symbolic link also allows users to easily see that /bin/sh is not a true Bourne shell.

...

...

If the gunzip and zcat programs exist, they must be symbolic or hard links to gzip. /bin/csh may be a symbolic link to /bin/tcsh or /usr/bin/tcsh.

References

Why are reboot, shutdown and poweroff symlinks to systemctl?