Find duplicate PDF files by content

duplicateimagemagickpdfscripting

Some journals generate a different PDF for each download. APS for example stores time and the IP address in the PDF.

Or there is a paper version with hyper links and one with text references.

How is it possible to find duplicate downloads of papers with 90 % equal content on a linux system by using open source software?

I have been thinking about converting the PDF files to plain text in a temporary directory with pdf2txt. Then I could filter all filenames which diff a b results more than x lines. But this is not elegant at all and will fail with scanned publications. Journals often do not provide OCR text for old publications.

I also tried compare in the ImageMagick suite, but I could not handle multipage PDF files with this tool.

diffpdf 2.1.1 does a good job in a GUI on two files, but I could not figure out how to apply it on many files, and recent versions are not available under any open source license.

Best Answer

Since different publishers use different methods of "marking" the PDFs you need to make sure you compare without taking the markings into account.

You also need an efficient method to compare a new PDF to all already downloaded PDFs in case you repeatedly download the same PDF and it is e.g. marked with the IP and/or date-time-stamp as you suggest. You don't want to use a time consuming comparison mechanism that compares each new PDF with many already downloaded PDFs

What you need is a utility that strips each of the possible markings and generate a hash of the remaining data. You will need to keep a hash → file name map, which can be in a simple file, and if a computed hash is already in the file you have a duplicate (and delete it or do whatever needed) and if the hash in not yet there, you add the hash and file name. The file would look something like:

6fcb6969835d2db7742e81267437c432  /home/anthon/Downloads/explanation.pdf
fa24fed8ca824976673a51803934d6b9  /home/anthon/orders/your_order_20150320.pdf

That file is negligently small compared to the original PDFs. If you have millions of PDFs you might consider storing this data in a database. For efficiency sake you might want to include the filesize and number of pages in there (pdfinfo | egrep -E '^Pages:' | grep -Eo '[0-9]*').


The above pushes the problem to removing the markings and generating the hash. If you know where the PDF comes from when invoking the hash generating routine (i.e. if you do the downloads programmatically), you can fine-tune the hash generation based on that. But even without that there are several possibilities for hash generation:

  1. if the metadata for title and author is non-empty and does not including non-specific strings like "Acrobat" or "PDF" you could generate the hash based on just the author and title information. Use pdfinfo -E file.pdf | grep -E '^(Author:)|(Title:) | md5sum to get the hash. You can include the number of pages in calculating the hash as well ('Pages:' in the pdfinfo output).
  2. if the previous rule doesn't work and the PDF contains images, extract the images and generate a hash on the combined image data. If the images ever contain text in the footer or header like "Licensed to Joe User", strip an X number of lines form the top or bottom, before calculating the hash. If that markings is in some big lettered grayed background text this will of course not work, unless you filter out pixels that are not totally black (for that you could use imagemagick). You can use pdfimages to extract the image information into a temporary file.
  3. if the previous rules don't work (because there are no images) you can use pdftext to extract the text, filter out the marking (if you filter out a little to much, that is not a problem) and then generate the hash based on that.

Additionally you can compare if the file size of the old file found via the hash and see if is within certain margins with the new file. Compression and ifferences in strings (IP/date-time-stamp) should only result in less than one percent difference.

If you know the method the publisher uses when determining the hash, you can directly apply the "right" method of the above, but even without that you can check for the metadata and apply some heuristics, or determine the number of images in a file and compare that with the number of pages (if they are close you probably have a document consisting of scans). pdftext on scanned image PDFs also has a recognisable output.


As a basis to work from I created a python package that is on bitbucket and/or can be installed from PyPI using pip install ruamel.pdfdouble. This provides you with the pdfdbl command that does the scanning as described above on metadata, extracted images or on text. It doesn't do any filtering of markings (yet), but the readme describes which (two) methods to enhance to do add that.

The included readme:

ruamel.pdfdouble

this package provides the pdfdbl command:

pdfdbl scan dir1 dir2

This will walk down the directories provided as argument and for the PDF files found, create a hash based on (in order):

  • metadata if unique
  • images if the number of images
  • text

This assumes that pdfinfo, pdfimages and pdftotext` from the poppler-utils package are avaialable.

A "database" is build up in ~/.config/pdfdbl/pdf.lst against which further scans are tested.

Removing markings

In ruamel/pdfdouble/pdfdouble.py there are two methods that can be enhanced to filter out markings in the PDF that make them less unique and make virtually the same files to have different hashes.

For text the method PdfData.filter_for_marking should be extended to remove and markings from the string that is its arguments and return the result.

For scanned images the method PdfData.process_image_and_update needs to be enhanced, e.g. by cutting off the images bottom and top X lines, and by removing any gray background text by setting all black pixels to white. This function needs to update the hash passed in using the .update() method passing in the filtered data.

Restrictions

The current "database" cannot handle paths that contain newlines

This utility is currently Python 2.7 only.


IP conforming stringparts can be substituted with Python's re module:

import re
IPre = re.compile("(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}"
              "([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])")

x = IPre.sub(' ', 'abcd 132.234.0.2 ghi')
assert x == 'abcd   ghi'
Related Question