slhck's answer and scruss' comment deserve to be updated : pdfimages
now (at least since version 0.26.5) explicitely lists x-ppi
and y-ppi
. Here is an sample output :
$ pdfimages -list example.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2244 2244 cmyk 4 8 image no 215 0 301 301 418K 2.1%
2 1 image 900 600 rgb 3 8 image no 324 0 1524 1525 35.5K 2.2%
On Debian (Wheezy) and Fedora (23), pdfimages
is part of poppler-utils
packages.
I modified mupdf's pdfdraw to support drawing in best fit mode, so I could state that the output needed to be 128x128 at most and it would fit the output in the box while maintaining the aspect ratio. Before I did that the only way was to use pdfinfo to get the page size and then do the calcuations to fit it in a box and then ask pdfdraw to draw it with that scale factor (dots per inch).
Well, after that long story the process to do that is rather simple:
get the page size of the page to render (in pdf terms the media box) this can be done via pdfinfo and grep and will appear in pts (points, 1/72th of an inch) or via a pdf library like pyPDF like:
import pyPdf
p = pyPdf.PdfFileReader(file("/home/dan/Desktop/Sieve-JFP.pdf", "rb"))
x,y,w,h = p.pages[0]['/MediaBox']
for a box fit do dpi = min( A/(w/72.), B/(h/72.) )
where A
is the maximum width and B
is the maximum height; w
and h
are the width and height of the page.
- pass
dpi
to convert -density $dpi
and as requested a slightly fudged git commit diff:
commit 0000000000000000000000000000000000000000
Author: Dan D.
Date: Thu Jul 28 16:33:33 2011 -0400
add options to pdfdraw to limit the output's width and height
note that scaling must occur before rotation
diff --git a/apps/pdfdraw.c b/apps/pdfdraw.c
index 0000000..1234567 100644
--- a/apps/pdfdraw.c
+++ b/apps/pdfdraw.c
@@ -12,8 +12,10 @@
#endif
char *output = NULL;
-float resolution = 72;
+float resolution = -1;
float rotation = 0;
+float width = -1;
+float height = -1;
int showxml = 0;
int showtext = 0;
@@ -47,6 +49,8 @@ static void usage(void)
"\t\tsupported formats: pgm, ppm, pam, png, pbm\n"
"\t-p -\tpassword\n"
"\t-r -\tresolution in dpi (default: 72)\n"
+ "\t-w -\tmaximum width (default: no limit)\n"
+ "\t-h -\tmaximum height (default: no limit)\n"
"\t-A\tdisable accelerated functions\n"
"\t-a\tsave alpha channel (only pam and png)\n"
"\t-b -\tnumber of bits of antialiasing (0 to 8)\n"
@@ -150,13 +154,39 @@ static void drawpage(pdf_xref *xref, int pagenum)
if (output || showmd5 || showtime)
{
- float zoom;
+ float zoom = 1.0;
fz_matrix ctm;
fz_bbox bbox;
fz_pixmap *pix;
+ float W, H;
- zoom = resolution / 72;
- ctm = fz_translate(0, -page->mediabox.y1);
+ ctm = fz_identity;
+ ctm = fz_concat(ctm, fz_translate(0, -page->mediabox.y1));
+ ctm = fz_concat(ctm, fz_rotate(page->rotate));
+ ctm = fz_concat(ctm, fz_rotate(rotation));
+ bbox = fz_round_rect(fz_transform_rect(ctm, page->mediabox));
+
+ W = bbox.x1 - bbox.x0;
+ H = bbox.y1 - bbox.y0;
+ if (resolution != -1)
+ zoom = resolution / 72;
+ if (width != -1)
+ {
+ if (resolution != -1)
+ zoom = MIN(zoom, width/W);
+ else
+ zoom = width/W;
+ }
+ if (height != -1)
+ {
+ if (resolution != -1 || width != -1)
+ zoom = MIN(zoom, height/H);
+ else
+ zoom = height/H;
+ }
+
+ ctm = fz_identity;
+ ctm = fz_concat(ctm, fz_translate(0, -page->mediabox.y1));
ctm = fz_concat(ctm, fz_scale(zoom, -zoom));
ctm = fz_concat(ctm, fz_rotate(page->rotate));
ctm = fz_concat(ctm, fz_rotate(rotation));
@@ -295,7 +325,7 @@ int main(int argc, char **argv)
fz_error error;
int c;
- while ((c = fz_getopt(argc, argv, "o:p:r:R:Aab:dgmtx5")) != -1)
+ while ((c = fz_getopt(argc, argv, "o:p:r:R:w:h:Aab:dgmtx5")) != -1)
{
switch (c)
{
@@ -303,6 +333,8 @@ int main(int argc, char **argv)
case 'p': password = fz_optarg; break;
case 'r': resolution = atof(fz_optarg); break;
case 'R': rotation = atof(fz_optarg); break;
+ case 'w': width = atof(fz_optarg); break;
+ case 'h': height = atof(fz_optarg); break;
case 'A': accelerate = 0; break;
case 'a': savealpha = 1; break;
case 'b': alphabits = atoi(fz_optarg); break;
@@ -321,6 +353,10 @@ int main(int argc, char **argv)
if (fz_optind == argc)
usage();
+ if (width+height == -2)
+ if (resolution == -1)
+ resolution = 72;
+
if (!showtext && !showxml && !showtime && !showmd5 && !output)
{
printf("nothing to do\n");
Best Answer
I think you were on the right track with Acrobat Pro.
You need to change the actual image resolution though. You could do this by manually creating a low-res version (e.g. 50% W x 50% H) and replacing the existing image in the pdf with your new one. In this case, the image dpi inside the pdf would need to be halved too in order to preserve the size. If you kept the same dpi it would appear a quarter of the size.
As long as the document dimensions don't change the OCR text should be mapped to the same spatial coordinates.
Edit: batch processing using Acrobat Pro
Below you can see Acrobat Pro (XI) can view the image properties.
There is a simple mathematical relationship: 2 = 3 / 1 .
Below you can run a sort of 'smart-filter' on your pdf using Acrobat Pro, one of the available preset filters is reducing image dpi. So you can just run this preflight option or create your own. You can adjust downscaling options and image compression methods.
I think you can batch process multiple pdf files using this method in combination with the "action wizard" tool.