I have a pdf document that was created by concatenation of (a huge number) of little documents. For example, 1.pdf, 2.pdf and 3.pdf. The problem is that the last page of 1.pdf is also the first page of 2.pdf, and the past page of 2.pdf is also the first … you get the idea.
So, after joining, I got the pdf document with a lot of duplicate pages. And the document has about 12000 pages (!!). Is there a way to automatically detect duplicate pages and remove them ?
Or any ideas how to make this a little easier ?
Best Answer
If your "identical" pages render into exactly the same visual appearance on screen, the following algorithmic approach could work to find out duplicates:
pdftk.exe
commandline on the original PDF to remove the duplicates.You could code this algorithm in any language you like (even batch on Windows or bash on Linux/Unix/MacOSX).
First: Some notes on using Ghostscript. Create your 1200 TIFF (or JPEG) pages (on Linux you'd use
gs
instead ofgswin32c
):Second: Some notes on the requirement of using (the freely available) libtiff utilities. When Ghostscript creates a TIFF page, it will note its current version, date and time plus some other meta data inside the TIFF. This could botch your MD5 checking, because otherwise identical TIFFs may carry a different date/time stamp. Hence the need to "normalize" these. Use
tiffinfo page-000001.tif
ortiffdump page-000001.tif
to see what I mean. You could see s.th. like this:Here is the command to "normalize" the date+time fields (which are tagged "306" in my case) in an example TIFF:
As a result, the DateTime field now has changed:
Now loop through all your TIFFs to normalize all their DateTime fields:
Third and Fourth: Run md5sum.exe and sort the list of files to find duplicates. Here is a commandline to use:
As a result you should easily see which files/pages have the same MD5 hash. It will look similar to this:
I leave it to you to automate that step.
Fifth and Sixth: Delete all duplicate pages from your original PDF. Assume you now want to delete pages 33, 77, 188, 444, 700 and 900. Here is the
pdftk.exe
command to achieve this:*Edit: Don't know why I suggested TIFF at first -- more intelligent would be to use BMP. *
If you use
-sDEVICE=bmp256
and-sOutputFile=C:\temp\tiffs\page-%06d.bmp
you will not have to deal with the 'normalisation' step I outlined above. The rest of the procedure (md5sum ...
) is the same....