Find duplicate PDF files by content

duplicateimagemagickpdfscripting

Some journals generate a different PDF for each download. APS for example stores time and the IP address in the PDF.

Or there is a paper version with hyper links and one with text references.

How is it possible to find duplicate downloads of papers with 90 % equal content on a linux system by using open source software?

I have been thinking about converting the PDF files to plain text in a temporary directory with pdf2txt. Then I could filter all filenames which diff a b results more than x lines. But this is not elegant at all and will fail with scanned publications. Journals often do not provide OCR text for old publications.

I also tried compare in the ImageMagick suite, but I could not handle multipage PDF files with this tool.

diffpdf 2.1.1 does a good job in a GUI on two files, but I could not figure out how to apply it on many files, and recent versions are not available under any open source license.

Best Answer

Since different publishers use different methods of "marking" the PDFs you need to make sure you compare without taking the markings into account.

You also need an efficient method to compare a new PDF to all already downloaded PDFs in case you repeatedly download the same PDF and it is e.g. marked with the IP and/or date-time-stamp as you suggest. You don't want to use a time consuming comparison mechanism that compares each new PDF with many already downloaded PDFs

What you need is a utility that strips each of the possible markings and generate a hash of the remaining data. You will need to keep a hash → file name map, which can be in a simple file, and if a computed hash is already in the file you have a duplicate (and delete it or do whatever needed) and if the hash in not yet there, you add the hash and file name. The file would look something like:

6fcb6969835d2db7742e81267437c432  /home/anthon/Downloads/explanation.pdf
fa24fed8ca824976673a51803934d6b9  /home/anthon/orders/your_order_20150320.pdf

That file is negligently small compared to the original PDFs. If you have millions of PDFs you might consider storing this data in a database. For efficiency sake you might want to include the filesize and number of pages in there (pdfinfo | egrep -E '^Pages:' | grep -Eo '[0-9]*').

The above pushes the problem to removing the markings and generating the hash. If you know where the PDF comes from when invoking the hash generating routine (i.e. if you do the downloads programmatically), you can fine-tune the hash generation based on that. But even without that there are several possibilities for hash generation:

if the metadata for title and author is non-empty and does not including non-specific strings like "Acrobat" or "PDF" you could generate the hash based on just the author and title information. Use pdfinfo -E file.pdf | grep -E '^(Author:)|(Title:) | md5sum to get the hash. You can include the number of pages in calculating the hash as well ('Pages:' in the pdfinfo output).
if the previous rule doesn't work and the PDF contains images, extract the images and generate a hash on the combined image data. If the images ever contain text in the footer or header like "Licensed to Joe User", strip an X number of lines form the top or bottom, before calculating the hash. If that markings is in some big lettered grayed background text this will of course not work, unless you filter out pixels that are not totally black (for that you could use imagemagick). You can use pdfimages to extract the image information into a temporary file.
if the previous rules don't work (because there are no images) you can use pdftext to extract the text, filter out the marking (if you filter out a little to much, that is not a problem) and then generate the hash based on that.

Additionally you can compare if the file size of the old file found via the hash and see if is within certain margins with the new file. Compression and ifferences in strings (IP/date-time-stamp) should only result in less than one percent difference.

If you know the method the publisher uses when determining the hash, you can directly apply the "right" method of the above, but even without that you can check for the metadata and apply some heuristics, or determine the number of images in a file and compare that with the number of pages (if they are close you probably have a document consisting of scans). pdftext on scanned image PDFs also has a recognisable output.

As a basis to work from I created a python package that is on bitbucket and/or can be installed from PyPI using pip install ruamel.pdfdouble. This provides you with the pdfdbl command that does the scanning as described above on metadata, extracted images or on text. It doesn't do any filtering of markings (yet), but the readme describes which (two) methods to enhance to do add that.

The included readme:

ruamel.pdfdouble

this package provides the pdfdbl command:

pdfdbl scan dir1 dir2

This will walk down the directories provided as argument and for the PDF files found, create a hash based on (in order):

metadata if unique
images if the number of images
text

This assumes that pdfinfo, pdfimages and pdftotext` from the poppler-utils package are avaialable.

A "database" is build up in ~/.config/pdfdbl/pdf.lst against which further scans are tested.

Removing markings

In ruamel/pdfdouble/pdfdouble.py there are two methods that can be enhanced to filter out markings in the PDF that make them less unique and make virtually the same files to have different hashes.

For text the method PdfData.filter_for_marking should be extended to remove and markings from the string that is its arguments and return the result.

For scanned images the method PdfData.process_image_and_update needs to be enhanced, e.g. by cutting off the images bottom and top X lines, and by removing any gray background text by setting all black pixels to white. This function needs to update the hash passed in using the .update() method passing in the filtered data.

Restrictions

The current "database" cannot handle paths that contain newlines

This utility is currently Python 2.7 only.

IP conforming stringparts can be substituted with Python's re module:

import re
IPre = re.compile("(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}"
              "([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])")

x = IPre.sub(' ', 'abcd 132.234.0.2 ghi')
assert x == 'abcd   ghi'

Method #1 - pdfcrop

There's a command line tool called pdfcrop that sounds like what you're looking for.

pdfcrop is a utility to calculate and remove empty margins from each page in the input PDF file. The resulting output file occupies the minimal paper size needed for the contents and is therefore suitable for inclusion as a graphic

Examples

$ pdfcrop --margins 10 input.pdf output.pdf
$ pdfcrop --margins ’5 10 5 20’ --clip input.pdf output.pdf

Method #2 - pdfjam

As an alternative there's another tool called pdfjam.

$ pdfjam --twoside --offset '2cm 0cm' file.pdf

References

Ways to convert and combine image files into a pdf file

Maybe is a long shot, but I use pdflatex. I create a file (with a script or whatever) of the style:

\documentclass{report}
\usepackage{graphicx}
\begin{document}
\includegraphics[width=0.95\textwidth]{img000}\par
\includegraphics[width=0.95\textwidth]{img001}\par

[...]

\includegraphics[width=0.95\textwidth]{img200}\par
\end{document}

And then run it with pdflatex file. The composition is fast (and you can easily --- if you know LaTeX --- change shape and position of the images, adding captions, etc...)

The problem is that the file is normally way big; I tested with 200 jpg of 500K+ --- the run took around 7 seconds on my i5/16G ram and gave a 800Mbyte PDF. I am trying to reduce its size by using

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=lowres.pdf file.pdf

...and it has been running 8 minutes, but it has not used a lot of RAM. I cannot comment on the compression because gs is smarter than me and discovered I was using the same image 200 times so compressed the thing to a 50k PDF... which is clearly not real.