First; Is there a reason you need to use symlinks and not the usual hardlinks? I am having a hard time understanding the need for symlinks with relative paths. Here is how I would solve this problem:
I think the Debian (Ubuntu) version of fdupes can replace duplicates with hard
links using the -L
option, but I don't have a Debian installation to verify
this.
If you do not have a version with the -L
option you can use this tiny bash script I found on commandlinefu.
Note that this syntax will only work in bash.
fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done
The above command will find all duplicate files in "path" and replace them with
hardlinks. You can verify this by running ls -ilR
and looking at the inode
number. Here is a samle with ten identical files:
$ ls -ilR
total 20
3094308 -rw------- 1 username group 5 Sep 14 17:21 file
3094311 -rw------- 1 username group 5 Sep 14 17:21 file2
3094312 -rw------- 1 username group 5 Sep 14 17:21 file3
3094313 -rw------- 1 username group 5 Sep 14 17:21 file4
3094314 -rw------- 1 username group 5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory
./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5
All the files have separate inode numbers, making them separate files.
Now lets deduplicate them:
$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:24 subdirectory
./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5
The files now all have the same inode number, meaning they all point to the same
physical data on disk.
I hope this solves your problem or at least points you in the right direction!
Since different publishers use different methods of "marking" the PDFs you need to make sure you compare without taking the markings into account.
You also need an efficient method to compare a new PDF to all already downloaded PDFs in case you repeatedly download the same PDF and it is e.g. marked with the IP and/or date-time-stamp as you suggest. You don't want to use a time consuming comparison mechanism that compares each new PDF with many already downloaded PDFs
What you need is a utility that strips each of the possible markings and generate a hash of the remaining data. You will need to keep a hash → file name map, which can be in a simple file, and if a computed hash is already in the file you have a duplicate (and delete it or do whatever needed) and if the hash in not yet there, you add the hash and file name. The file would look something like:
6fcb6969835d2db7742e81267437c432 /home/anthon/Downloads/explanation.pdf
fa24fed8ca824976673a51803934d6b9 /home/anthon/orders/your_order_20150320.pdf
That file is negligently small compared to the original PDFs. If you have millions of PDFs you might consider storing this data in a database. For efficiency sake you might want to include the filesize and number of pages in there (pdfinfo | egrep -E '^Pages:' | grep -Eo '[0-9]*'
).
The above pushes the problem to removing the markings and generating the hash. If you know where the PDF comes from when invoking the hash generating routine (i.e. if you do the downloads programmatically), you can fine-tune the hash generation based on that. But even without that there are several possibilities for hash generation:
- if the metadata for title and author is non-empty and does not including non-specific strings like "Acrobat" or "PDF" you could generate the hash based on just the author and title information. Use
pdfinfo -E file.pdf | grep -E '^(Author:)|(Title:) | md5sum
to get the hash. You can include the number of pages in calculating the hash as well ('Pages:
' in the pdfinfo
output).
- if the previous rule doesn't work and the PDF contains images, extract
the images and generate a hash on the combined image data. If the images ever contain text in the footer or header like "Licensed to Joe User", strip an X number of lines form the top or bottom, before calculating the hash. If that markings is in some big lettered grayed background text this will of course not work, unless you filter out pixels that are not totally black (for that you could use
imagemagick
). You can use pdfimages
to extract the image information into a temporary file.
- if the previous rules don't work (because there are no images) you can use
pdftext
to extract the text, filter out the marking (if you filter out a little to much, that is not a problem) and then generate the hash based on that.
Additionally you can compare if the file size of the old file found via the hash and see if is within certain margins with the new file. Compression and ifferences in strings (IP/date-time-stamp) should only result in less than one percent difference.
If you know the method the publisher uses when determining the hash, you can directly apply the "right" method of the above, but even without that you can check for the metadata and apply some heuristics, or determine the number of images in a file and compare that with the number of pages (if they are close you probably have a document consisting of scans). pdftext
on scanned image PDFs also has a recognisable output.
As a basis to work from I created a python package that is on bitbucket and/or can be installed from PyPI using pip install ruamel.pdfdouble
.
This provides you with the pdfdbl
command that does the scanning as described above on metadata, extracted images or on text.
It doesn't do any filtering of markings (yet), but the readme describes which (two) methods to enhance to do add that.
The included readme:
ruamel.pdfdouble
this package provides the pdfdbl
command:
pdfdbl scan dir1 dir2
This will walk down the directories provided as argument and for the PDF files found, create a hash based on (in order):
- metadata if unique
- images if the number of images
- text
This assumes that pdfinfo, pdfimages and pdftotext` from the poppler-utils package are avaialable.
A "database" is build up in ~/.config/pdfdbl/pdf.lst
against which further scans are tested.
Removing markings
In ruamel/pdfdouble/pdfdouble.py
there are two methods that can be enhanced to filter out markings in the PDF that make them less unique and make virtually the same files to have different hashes.
For text the method PdfData.filter_for_marking
should be extended to remove and markings from the string that is its arguments and return the result.
For scanned images the method PdfData.process_image_and_update
needs to be enhanced, e.g. by cutting off the images bottom and top X lines, and by removing any gray background text by setting all black pixels to white. This function needs to update the hash passed in using the .update()
method passing in the filtered data.
Restrictions
The current "database" cannot handle paths that contain newlines
This utility is currently Python 2.7 only.
IP conforming stringparts can be substituted with Python's re
module:
import re
IPre = re.compile("(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}"
"([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])")
x = IPre.sub(' ', 'abcd 132.234.0.2 ghi')
assert x == 'abcd ghi'
Best Answer
In my experience
fdupes
can be inconsistent in the order that it outputs files (I have had my own problems using the--delete
option). This should be fairly robust as it doesn't require the files to be in a specific order (as as long as there are always two dupes in different folders):This will just print out the
mv
commands, remove theecho
when you are sure you have what you want. Also the-i
option formv
will prompt you if it is going to overwrite anything.