In my experience fdupes
can be inconsistent in the order that it outputs files (I have had my own problems using the --delete
option). This should be fairly robust as it doesn't require the files to be in a specific order (as as long as there are always two dupes in different folders):
# note no trailing slash
source_dir=/home/articles
target_dir=/external/articles
fdupes "$target_dir" "$source_dir" |
while IFS= read file; do
case "$file" in
"$source_dir/"*)
source=${file##*/}
;;
"$target_dir/"*)
target=$file
;;
'')
if [ "$source" ] && [ "$target" ]; then
echo mv -i "$target" "$target_dir/$source"
fi
unset source target
;;
esac
done
This will just print out the mv
commands, remove the echo
when you are sure you have what you want. Also the -i
option for mv
will prompt you if it is going to overwrite anything.
First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:
I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard
links using the -L
option, but I don't have a Debian installation to verify
this.
If you do not have a version with the -L
option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.
fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done
The above command will find all duplicate files in "path" and replace them with
hardlinks. You can verify this by running ls -ilR
and looking at the inode
number. Here is a samle with ten identical files:
$ ls -ilR
total 20
3094308 -rw------- 1 username group 5 Sep 14 17:21 file
3094311 -rw------- 1 username group 5 Sep 14 17:21 file2
3094312 -rw------- 1 username group 5 Sep 14 17:21 file3
3094313 -rw------- 1 username group 5 Sep 14 17:21 file4
3094314 -rw------- 1 username group 5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory
./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5
All the files have separate inode numbers, making them separate files.
Now lets deduplicate them:
$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:24 subdirectory
./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5
The files now all have the same inode number, meaning they all point to the same
physical data on disk.
I hope this solves your problem or at least points you in the right direction!
Best Answer
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want: