Windows – Batch crop PDF pages

croppdfwindows 7

I have a PDF document containing pages which have crop marks on them. I'd like to copy these pages to another PDF without the crop marks. I'm assuming I have to crop-out the crop marks but is there any way to do this in batch rather than interactively?

Best Answer

You can install ImageMagick to create a small batch script

example.bat:

convert yourpdf.pdf tempfile.png
convert -crop widthxheight+xoffset+yoffset *.png
convert *.png newpdf.pdf

This script temporarily convert the pdf pages into a series of images. Then crop all images (specify your width, height, x offset and y offset) Then reassembles all pages to new pdf file.

For more information about ImageMagick: http://www.imagemagick.org/script/command-line-processing.php

Related Solutions

Pdf – Finding duplicate pages in pdf document

If your "identical" pages render into exactly the same visual appearance on screen, the following algorithmic approach could work to find out duplicates:

Convert each page into a low-res TIFF or JPEG file using Ghostscript (f.e. using 72dpi).
In case you use TIFF: run one of the libtiff commandline utilities to "normalize" the TIFF meta data.
Run md5sum.exe on each TIFF or JPEG page and remember the Md5sum for each page.
Sort the list of MD5sums to find the duplicate pages.
Remember all duplicate page numbers to be deleted.
Run a pdftk.exe commandline on the original PDF to remove the duplicates.

You could code this algorithm in any language you like (even batch on Windows or bash on Linux/Unix/MacOSX).

First: Some notes on using Ghostscript. Create your 1200 TIFF (or JPEG) pages (on Linux you'd use gs instead of gswin32c):

gswin32c.exe ^
      -dBATCH -dNOPAUSE -dSAFER ^
      -sDEVICE=tiffg4 ^
      -sOutputFile=C:\temp\tiffs\page-%06d.tif ^
      -r72x72 ^
      12000pages.pdf ^
# use -sDEVICE=jpeg to create *.jpeg files + adapt -sOutputFile= accordingly
# page-%06d.tif creates TIFFs named page-000001.tif through page-012000.tif*

Second: Some notes on the requirement of using (the freely available) libtiff utilities. When Ghostscript creates a TIFF page, it will note its current version, date and time plus some other meta data inside the TIFF. This could botch your MD5 checking, because otherwise identical TIFFs may carry a different date/time stamp. Hence the need to "normalize" these. Use tiffinfo page-000001.tif or tiffdump page-000001.tif to see what I mean. You could see s.th. like this:

c:\downloads> tiffdump.exe page-000001.tif
  page-000001.tif:
  Magic: 0x4949 <little-endian> Version: 0x2a
  Directory 0: offset 2814 (0xafe) next 0 (0)
  SubFileType (254) LONG (4) 1<2>
  ImageWidth (256) SHORT (3) 1<595>
  ImageLength (257) SHORT (3) 1<842>
  BitsPerSample (258) SHORT (3) 1<1>
  Compression (259) SHORT (3) 1<4>
  Photometric (262) SHORT (3) 1<0>
  FillOrder (266) SHORT (3) 1<1>
  StripOffsets (273) LONG (4) 8<8 341 1979 1996 2013 2030 2047 2064>
  Orientation (274) SHORT (3) 1<1>
  SamplesPerPixel (277) SHORT (3) 1<1>
  RowsPerStrip (278) SHORT (3) 1<109>
  StripByteCounts (279) LONG (4) 8<333 1638 17 17 17 17 17 13>
  XResolution (282) RATIONAL (5) 1<72>
  YResolution (283) RATIONAL (5) 1<72>
  PlanarConfig (284) SHORT (3) 1<1>
  Group4Options (293) LONG (4) 1<0>
  ResolutionUnit (296) SHORT (3) 1<2>
  PageNumber (297) SHORT (3) 2<0 0>
  Software (305) ASCII (2) 21<GPL Ghostscript 8.71\0>
  DateTime (306) ASCII (2) 20<2010:06:22 04:56:12\0>

Here is the command to "normalize" the date+time fields (which are tagged "306" in my case) in an example TIFF:

 c:\downloads> tiffset -s 306 "0000:00:00 00:00:00" ex001.tif

As a result, the DateTime field now has changed:

 c:\pa>tiffdump ex001.tif | findstr DateTime
   DateTime (306) ASCII (2) 20<0000:00:00 00:00:00\0>

Now loop through all your TIFFs to normalize all their DateTime fields:

 c:\downloads> for /l %i in (C:\temp\tiffs\*.tif) ^
                  do tiffset -s 306 "0000:00:00 00:00:00" %i

Third and Fourth: Run md5sum.exe and sort the list of files to find duplicates. Here is a commandline to use:

 c:\downloads> md5sum.exe C:\temp\tiffs\*.tif | sort

As a result you should easily see which files/pages have the same MD5 hash. It will look similar to this:

c:\> md5sum.exe c:/temp/tiffs/page-0*.tif
  [....]
  fae9fa136c4f7ecca23b6a34d620fb02 *c:\temp\tiffs\page-000032.tif
  fae9fa136c4f7ecca23b6a34d620fb02 *c:\temp\tiffs\page-000033.tif
  fb5fef1732148d71bfff841c214cf836 *c:\temp\tiffs\page-000076.tif
  fb5fef1732148d71bfff841c214cf836 *c:\temp\tiffs\page-000077.tif
  fb86c1bdbc697eef7cb869f4e2e2957b *c:\temp\tiffs\page-000187.tif
  fb86c1bdbc697eef7cb869f4e2e2957b *c:\temp\tiffs\page-000188.tif
  fbb801ab3ef7ea33619132f97dcab045 *c:\temp\tiffs\page-000443.tif
  fbb801ab3ef7ea33619132f97dcab045 *c:\temp\tiffs\page-000444.tif
  fbc33cc0ff3e1252de1653ef2e978f94 *c:\temp\tiffs\page-000699.tif
  fbc33cc0ff3e1252de1653ef2e978f94 *c:\temp\tiffs\page-000700.tif
  fc3fd164e20bb707acddeabbc4e60f7e *c:\temp\tiffs\page-000899.tif
  fc3fd164e20bb707acddeabbc4e60f7e *c:\temp\tiffs\page-000900.tif
  [....]

I leave it to you to automate that step.

Fifth and Sixth: Delete all duplicate pages from your original PDF. Assume you now want to delete pages 33, 77, 188, 444, 700 and 900. Here is the pdftk.exe command to achieve this:

 c: > pdftk.exe A=12000pages.pdf ^
          cat A1-32 A34-76 A78-187 A189-443 A445-699 A701-899 A901-end ^
          output nonduplicates.pdf

*Edit: Don't know why I suggested TIFF at first -- more intelligent would be to use BMP. *

If you use -sDEVICE=bmp256 and -sOutputFile=C:\temp\tiffs\page-%06d.bmp you will not have to deal with the 'normalisation' step I outlined above. The rest of the procedure (md5sum ...) is the same....

Remove PDF margins and print 2 pages per side

pdfjam does nup-imposition and gives you the complete power of pdfpages; I'm sure you can trim the pages while imposing them. It needs a working TeX-installation, though.

Best Answer

Related Solutions

Pdf – Finding duplicate pages in pdf document

Remove PDF margins and print 2 pages per side

Related Question