Faster way to rename duplicate files (identified by fdupes) in another directory

duplicatefdupesfilesrenamescripting

I have a directory full of pdf files of journal articles, most of which are named by their bibtex key. Some time ago I made a backup on an external hard drive, but I haven't kept it up to date and there are tons of duplicates with different names. I want to get the two directories back into sync and delete the extra files.

Using fdupes I have identified a bunch of these, and now I have a nice paired list of them. However, most of the duplicates on the external drive have meaningless names. I'd like to rename them to be the same as the duplicate in the first directory, rather than deleting them and copying them over again, because there are so many of them. So I don't want to just use rsync.

For example, if the fdupes output is:

/home/articles/bibtex.pdf
/external/articles/morearticles44.pdf

Is there a faster way than writing

mv /external/articles/morearticles44.pdf /external/articles/bibtex.pdf

for each pair of duplicates?

Best Answer

In my experience fdupes can be inconsistent in the order that it outputs files (I have had my own problems using the --delete option). This should be fairly robust as it doesn't require the files to be in a specific order (as as long as there are always two dupes in different folders):

# note no trailing slash
source_dir=/home/articles
target_dir=/external/articles

fdupes "$target_dir" "$source_dir" |
  while IFS= read file; do
    case "$file" in
      "$source_dir/"*)
         source=${file##*/}
         ;;
      "$target_dir/"*)
         target=$file
         ;;
      '')
         if [ "$source" ] && [ "$target" ]; then
           echo mv -i "$target" "$target_dir/$source"
         fi
         unset source target
         ;;
    esac
  done

This will just print out the mv commands, remove the echo when you are sure you have what you want. Also the -i option for mv will prompt you if it is going to overwrite anything.

Related Solutions

Finding duplicate files and replace them with symlinks

First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:

I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard links using the -L option, but I don't have a Debian installation to verify this.

If you do not have a version with the -L option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.

fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done

The above command will find all duplicate files in "path" and replace them with hardlinks. You can verify this by running ls -ilR and looking at the inode number. Here is a samle with ten identical files:

$ ls -ilR

total 20
3094308 -rw------- 1 username group  5 Sep 14 17:21 file
3094311 -rw------- 1 username group  5 Sep 14 17:21 file2
3094312 -rw------- 1 username group  5 Sep 14 17:21 file3
3094313 -rw------- 1 username group  5 Sep 14 17:21 file4
3094314 -rw------- 1 username group  5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory

./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5

All the files have separate inode numbers, making them separate files. Now lets deduplicate them:

$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group  5 Sep 14 17:21 file
3094308 -rw------- 10 username group  5 Sep 14 17:21 file2
3094308 -rw------- 10 username group  5 Sep 14 17:21 file3
3094308 -rw------- 10 username group  5 Sep 14 17:21 file4
3094308 -rw------- 10 username group  5 Sep 14 17:21 file5
3094315 drwx------  1 username group 48 Sep 14 17:24 subdirectory

./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5

The files now all have the same inode number, meaning they all point to the same physical data on disk.

I hope this solves your problem or at least points you in the right direction!

Find duplicate PDF files by content

Since different publishers use different methods of "marking" the PDFs you need to make sure you compare without taking the markings into account.

You also need an efficient method to compare a new PDF to all already downloaded PDFs in case you repeatedly download the same PDF and it is e.g. marked with the IP and/or date-time-stamp as you suggest. You don't want to use a time consuming comparison mechanism that compares each new PDF with many already downloaded PDFs

What you need is a utility that strips each of the possible markings and generate a hash of the remaining data. You will need to keep a hash → file name map, which can be in a simple file, and if a computed hash is already in the file you have a duplicate (and delete it or do whatever needed) and if the hash in not yet there, you add the hash and file name. The file would look something like:

6fcb6969835d2db7742e81267437c432  /home/anthon/Downloads/explanation.pdf
fa24fed8ca824976673a51803934d6b9  /home/anthon/orders/your_order_20150320.pdf

That file is negligently small compared to the original PDFs. If you have millions of PDFs you might consider storing this data in a database. For efficiency sake you might want to include the filesize and number of pages in there (pdfinfo | egrep -E '^Pages:' | grep -Eo '[0-9]*').

The above pushes the problem to removing the markings and generating the hash. If you know where the PDF comes from when invoking the hash generating routine (i.e. if you do the downloads programmatically), you can fine-tune the hash generation based on that. But even without that there are several possibilities for hash generation:

if the metadata for title and author is non-empty and does not including non-specific strings like "Acrobat" or "PDF" you could generate the hash based on just the author and title information. Use pdfinfo -E file.pdf | grep -E '^(Author:)|(Title:) | md5sum to get the hash. You can include the number of pages in calculating the hash as well ('Pages:' in the pdfinfo output).
if the previous rule doesn't work and the PDF contains images, extract the images and generate a hash on the combined image data. If the images ever contain text in the footer or header like "Licensed to Joe User", strip an X number of lines form the top or bottom, before calculating the hash. If that markings is in some big lettered grayed background text this will of course not work, unless you filter out pixels that are not totally black (for that you could use imagemagick). You can use pdfimages to extract the image information into a temporary file.
if the previous rules don't work (because there are no images) you can use pdftext to extract the text, filter out the marking (if you filter out a little to much, that is not a problem) and then generate the hash based on that.

Additionally you can compare if the file size of the old file found via the hash and see if is within certain margins with the new file. Compression and ifferences in strings (IP/date-time-stamp) should only result in less than one percent difference.

If you know the method the publisher uses when determining the hash, you can directly apply the "right" method of the above, but even without that you can check for the metadata and apply some heuristics, or determine the number of images in a file and compare that with the number of pages (if they are close you probably have a document consisting of scans). pdftext on scanned image PDFs also has a recognisable output.

As a basis to work from I created a python package that is on bitbucket and/or can be installed from PyPI using pip install ruamel.pdfdouble. This provides you with the pdfdbl command that does the scanning as described above on metadata, extracted images or on text. It doesn't do any filtering of markings (yet), but the readme describes which (two) methods to enhance to do add that.

The included readme:

ruamel.pdfdouble

this package provides the pdfdbl command:

pdfdbl scan dir1 dir2

This will walk down the directories provided as argument and for the PDF files found, create a hash based on (in order):

metadata if unique
images if the number of images
text

This assumes that pdfinfo, pdfimages and pdftotext` from the poppler-utils package are avaialable.

A "database" is build up in ~/.config/pdfdbl/pdf.lst against which further scans are tested.

Removing markings

In ruamel/pdfdouble/pdfdouble.py there are two methods that can be enhanced to filter out markings in the PDF that make them less unique and make virtually the same files to have different hashes.

For text the method PdfData.filter_for_marking should be extended to remove and markings from the string that is its arguments and return the result.

For scanned images the method PdfData.process_image_and_update needs to be enhanced, e.g. by cutting off the images bottom and top X lines, and by removing any gray background text by setting all black pixels to white. This function needs to update the hash passed in using the .update() method passing in the filtered data.

Restrictions

The current "database" cannot handle paths that contain newlines

This utility is currently Python 2.7 only.

IP conforming stringparts can be substituted with Python's re module:

import re
IPre = re.compile("(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}"
              "([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])")

x = IPre.sub(' ', 'abcd 132.234.0.2 ghi')
assert x == 'abcd   ghi'