I have a PDF document containing pages which have crop marks on them. I'd like to copy these pages to another PDF without the crop marks. I'm assuming I have to crop-out the crop marks but is there any way to do this in batch rather than interactively?
Windows – Batch crop PDF pages
croppdfwindows 7
Related Solutions
If your "identical" pages render into exactly the same visual appearance on screen, the following algorithmic approach could work to find out duplicates:
- Convert each page into a low-res TIFF or JPEG file using Ghostscript (f.e. using 72dpi).
- In case you use TIFF: run one of the libtiff commandline utilities to "normalize" the TIFF meta data.
- Run md5sum.exe on each TIFF or JPEG page and remember the Md5sum for each page.
- Sort the list of MD5sums to find the duplicate pages.
- Remember all duplicate page numbers to be deleted.
- Run a
pdftk.exe
commandline on the original PDF to remove the duplicates.
You could code this algorithm in any language you like (even batch on Windows or bash on Linux/Unix/MacOSX).
First: Some notes on using Ghostscript. Create your 1200 TIFF (or JPEG) pages (on Linux you'd use gs
instead of gswin32c
):
gswin32c.exe ^
-dBATCH -dNOPAUSE -dSAFER ^
-sDEVICE=tiffg4 ^
-sOutputFile=C:\temp\tiffs\page-%06d.tif ^
-r72x72 ^
12000pages.pdf ^
# use -sDEVICE=jpeg to create *.jpeg files + adapt -sOutputFile= accordingly
# page-%06d.tif creates TIFFs named page-000001.tif through page-012000.tif*
Second: Some notes on the requirement of using (the freely available) libtiff utilities. When Ghostscript creates a TIFF page, it will note its current version, date and time plus some other meta data inside the TIFF. This could botch your MD5 checking, because otherwise identical TIFFs may carry a different date/time stamp. Hence the need to "normalize" these. Use tiffinfo page-000001.tif
or tiffdump page-000001.tif
to see what I mean. You could see s.th. like this:
c:\downloads> tiffdump.exe page-000001.tif
page-000001.tif:
Magic: 0x4949 <little-endian> Version: 0x2a
Directory 0: offset 2814 (0xafe) next 0 (0)
SubFileType (254) LONG (4) 1<2>
ImageWidth (256) SHORT (3) 1<595>
ImageLength (257) SHORT (3) 1<842>
BitsPerSample (258) SHORT (3) 1<1>
Compression (259) SHORT (3) 1<4>
Photometric (262) SHORT (3) 1<0>
FillOrder (266) SHORT (3) 1<1>
StripOffsets (273) LONG (4) 8<8 341 1979 1996 2013 2030 2047 2064>
Orientation (274) SHORT (3) 1<1>
SamplesPerPixel (277) SHORT (3) 1<1>
RowsPerStrip (278) SHORT (3) 1<109>
StripByteCounts (279) LONG (4) 8<333 1638 17 17 17 17 17 13>
XResolution (282) RATIONAL (5) 1<72>
YResolution (283) RATIONAL (5) 1<72>
PlanarConfig (284) SHORT (3) 1<1>
Group4Options (293) LONG (4) 1<0>
ResolutionUnit (296) SHORT (3) 1<2>
PageNumber (297) SHORT (3) 2<0 0>
Software (305) ASCII (2) 21<GPL Ghostscript 8.71\0>
DateTime (306) ASCII (2) 20<2010:06:22 04:56:12\0>
Here is the command to "normalize" the date+time fields (which are tagged "306" in my case) in an example TIFF:
c:\downloads> tiffset -s 306 "0000:00:00 00:00:00" ex001.tif
As a result, the DateTime field now has changed:
c:\pa>tiffdump ex001.tif | findstr DateTime
DateTime (306) ASCII (2) 20<0000:00:00 00:00:00\0>
Now loop through all your TIFFs to normalize all their DateTime fields:
c:\downloads> for /l %i in (C:\temp\tiffs\*.tif) ^
do tiffset -s 306 "0000:00:00 00:00:00" %i
Third and Fourth: Run md5sum.exe and sort the list of files to find duplicates. Here is a commandline to use:
c:\downloads> md5sum.exe C:\temp\tiffs\*.tif | sort
As a result you should easily see which files/pages have the same MD5 hash. It will look similar to this:
c:\> md5sum.exe c:/temp/tiffs/page-0*.tif
[....]
fae9fa136c4f7ecca23b6a34d620fb02 *c:\temp\tiffs\page-000032.tif
fae9fa136c4f7ecca23b6a34d620fb02 *c:\temp\tiffs\page-000033.tif
fb5fef1732148d71bfff841c214cf836 *c:\temp\tiffs\page-000076.tif
fb5fef1732148d71bfff841c214cf836 *c:\temp\tiffs\page-000077.tif
fb86c1bdbc697eef7cb869f4e2e2957b *c:\temp\tiffs\page-000187.tif
fb86c1bdbc697eef7cb869f4e2e2957b *c:\temp\tiffs\page-000188.tif
fbb801ab3ef7ea33619132f97dcab045 *c:\temp\tiffs\page-000443.tif
fbb801ab3ef7ea33619132f97dcab045 *c:\temp\tiffs\page-000444.tif
fbc33cc0ff3e1252de1653ef2e978f94 *c:\temp\tiffs\page-000699.tif
fbc33cc0ff3e1252de1653ef2e978f94 *c:\temp\tiffs\page-000700.tif
fc3fd164e20bb707acddeabbc4e60f7e *c:\temp\tiffs\page-000899.tif
fc3fd164e20bb707acddeabbc4e60f7e *c:\temp\tiffs\page-000900.tif
[....]
I leave it to you to automate that step.
Fifth and Sixth: Delete all duplicate pages from your original PDF. Assume you now want to delete pages 33, 77, 188, 444, 700 and 900. Here is the pdftk.exe
command to achieve this:
c: > pdftk.exe A=12000pages.pdf ^
cat A1-32 A34-76 A78-187 A189-443 A445-699 A701-899 A901-end ^
output nonduplicates.pdf
*Edit: Don't know why I suggested TIFF at first -- more intelligent would be to use BMP. *
If you use -sDEVICE=bmp256
and -sOutputFile=C:\temp\tiffs\page-%06d.bmp
you will not have to deal with the 'normalisation' step I outlined above. The rest of the procedure (md5sum ...
) is the same....
Best Answer
You can install ImageMagick to create a small batch script
example.bat:
This script temporarily convert the pdf pages into a series of images. Then crop all images (specify your width, height, x offset and y offset) Then reassembles all pages to new pdf file.
For more information about ImageMagick: http://www.imagemagick.org/script/command-line-processing.php