Pdf – Reduce resolution,size,dpi,number of pixels in PDF images

dpipdfresolution

I scanned a text at 600dpi and it turned out to be much more than I needed to make a PDF out of it. I've already OCR'ed the text and I want to retain the OCR in the PDF.

I want to decrease the number of pixels (dpi? Sorry, I'm not sure what I'd call them, I'm not used to image processing), so I can make the PDF size smaller. The images are too big when I open the PDF, it would be fine to shrink them by decreasing the number of pixels (as it is now, I can zoom in the images much more than I need).

How can I shrink image size by reducing the number of pixels(dpi))?

I don't want to re-print the PDF, or rescan it, because I don't want to lose the OCR. I tried using Adobe Acrobat Pro DC "Save as Optimized PDF", and shrank all images above 50dpi to 50dpi. It made the PDF bigger! (I think the PDF is already compressed; but I don't want compression, I want to reduce the number pixels/the resolution)

I'm on Windows 7, 64 bit

Best Answer

I think you were on the right track with Acrobat Pro.

You need to change the actual image resolution though. You could do this by manually creating a low-res version (e.g. 50% W x 50% H) and replacing the existing image in the pdf with your new one. In this case, the image dpi inside the pdf would need to be halved too in order to preserve the size. If you kept the same dpi it would appear a quarter of the size.

As long as the document dimensions don't change the OCR text should be mapped to the same spatial coordinates.

Edit: batch processing using Acrobat Pro

Below you can see Acrobat Pro (XI) can view the image properties.

Once an image is part of a pdf it gets a physical "size" on the 'virtual paper'.
The ppi (or dpi, but that is more for printing contexts) is a pdf metric that gives the ratio between physical size and number of pixels. I believe the unit pt/inch shown in the image is incorrect; it should be ppi. I also think calling it resolution is a poor word choice.
The real image resolution (width hight) is a pdf-independent image property, it affects how large the file is and how much you can meaningfully zoom in when viewing digitally.

There is a simple mathematical relationship: 2 = 3 / 1 .

What you want to do is reduce 3 while keeping 1 constant, thereby implicitly reducing 2 by a corresponding amount.
Most editors use the wording "change dpi" which is effectively the same: change 2 and implicitly adjust 3 such that 1 remains the same size.
But under the hood the largest change occurs to the image resolution (3), the ppi/dpi is just a number that needs to be updated in the pdf; so I find my wording better:)

Below you can run a sort of 'smart-filter' on your pdf using Acrobat Pro, one of the available preset filters is reducing image dpi. So you can just run this preflight option or create your own. You can adjust downscaling options and image compression methods.

I think you can batch process multiple pdf files using this method in combination with the "action wizard" tool.

Related Solutions

Finding Image resolution in PDF file

slhck's answer and scruss' comment deserve to be updated : pdfimages now (at least since version 0.26.5) explicitely lists x-ppi and y-ppi. Here is an sample output :

$ pdfimages -list example.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2244  2244  cmyk    4   8  image  no       215  0   301   301  418K 2.1%
   2     1 image     900   600  rgb     3   8  image  no       324  0  1524  1525 35.5K 2.2%

On Debian (Wheezy) and Fedora (23), pdfimages is part of poppler-utils packages.

PDF => raster, is it possible to adapt the sampling resolution to the input page size

I modified mupdf's pdfdraw to support drawing in best fit mode, so I could state that the output needed to be 128x128 at most and it would fit the output in the box while maintaining the aspect ratio. Before I did that the only way was to use pdfinfo to get the page size and then do the calcuations to fit it in a box and then ask pdfdraw to draw it with that scale factor (dots per inch).

Well, after that long story the process to do that is rather simple:

get the page size of the page to render (in pdf terms the media box) this can be done via pdfinfo and grep and will appear in pts (points, 1/72th of an inch) or via a pdf library like pyPDF like:
```
import pyPdf
p = pyPdf.PdfFileReader(file("/home/dan/Desktop/Sieve-JFP.pdf", "rb"))
x,y,w,h = p.pages[0]['/MediaBox']
```
for a box fit do dpi = min( A/(w/72.), B/(h/72.) )
where A is the maximum width and B is the maximum height; w and h are the width and height of the page.
pass dpi to convert -density $dpi

and as requested a slightly fudged git commit diff:

commit 0000000000000000000000000000000000000000
Author: Dan D.
Date:   Thu Jul 28 16:33:33 2011 -0400

    add options to pdfdraw to limit the output's width and height

    note that scaling must occur before rotation

diff --git a/apps/pdfdraw.c b/apps/pdfdraw.c
index 0000000..1234567 100644
--- a/apps/pdfdraw.c
+++ b/apps/pdfdraw.c
@@ -12,8 +12,10 @@
 #endif

 char *output = NULL;
-float resolution = 72;
+float resolution = -1;
 float rotation = 0;
+float width = -1;
+float height = -1;

 int showxml = 0;
 int showtext = 0;
@@ -47,6 +49,8 @@ static void usage(void)
        "\t\tsupported formats: pgm, ppm, pam, png, pbm\n"
        "\t-p -\tpassword\n"
        "\t-r -\tresolution in dpi (default: 72)\n"
+       "\t-w -\tmaximum width (default: no limit)\n"
+       "\t-h -\tmaximum height (default: no limit)\n"
        "\t-A\tdisable accelerated functions\n"
        "\t-a\tsave alpha channel (only pam and png)\n"
        "\t-b -\tnumber of bits of antialiasing (0 to 8)\n"
@@ -150,13 +154,39 @@ static void drawpage(pdf_xref *xref, int pagenum)

    if (output || showmd5 || showtime)
    {
-       float zoom;
+       float zoom = 1.0;
        fz_matrix ctm;
        fz_bbox bbox;
        fz_pixmap *pix;
+       float W, H;

-       zoom = resolution / 72;
-       ctm = fz_translate(0, -page->mediabox.y1);
+       ctm = fz_identity;
+       ctm = fz_concat(ctm, fz_translate(0, -page->mediabox.y1));
+       ctm = fz_concat(ctm, fz_rotate(page->rotate));
+       ctm = fz_concat(ctm, fz_rotate(rotation));
+       bbox = fz_round_rect(fz_transform_rect(ctm, page->mediabox));
+
+       W = bbox.x1 - bbox.x0; 
+       H = bbox.y1 - bbox.y0;
+       if (resolution != -1)
+           zoom = resolution / 72;
+       if (width != -1) 
+       {
+           if (resolution != -1)
+               zoom = MIN(zoom, width/W);
+           else
+               zoom = width/W;
+       }
+       if (height != -1)
+       {
+           if (resolution != -1 || width != -1)
+               zoom = MIN(zoom, height/H);
+           else
+               zoom = height/H;
+       }
+
+       ctm = fz_identity;
+       ctm = fz_concat(ctm, fz_translate(0, -page->mediabox.y1));
        ctm = fz_concat(ctm, fz_scale(zoom, -zoom));
        ctm = fz_concat(ctm, fz_rotate(page->rotate));
        ctm = fz_concat(ctm, fz_rotate(rotation));
@@ -295,7 +325,7 @@ int main(int argc, char **argv)
    fz_error error;
    int c;

-   while ((c = fz_getopt(argc, argv, "o:p:r:R:Aab:dgmtx5")) != -1)
+   while ((c = fz_getopt(argc, argv, "o:p:r:R:w:h:Aab:dgmtx5")) != -1)
    {
        switch (c)
        {
@@ -303,6 +333,8 @@ int main(int argc, char **argv)
        case 'p': password = fz_optarg; break;
        case 'r': resolution = atof(fz_optarg); break;
        case 'R': rotation = atof(fz_optarg); break;
+       case 'w': width = atof(fz_optarg); break;
+       case 'h': height = atof(fz_optarg); break;
        case 'A': accelerate = 0; break;
        case 'a': savealpha = 1; break;
        case 'b': alphabits = atoi(fz_optarg); break;
@@ -321,6 +353,10 @@ int main(int argc, char **argv)
    if (fz_optind == argc)
        usage();

+   if (width+height == -2)
+       if (resolution == -1)
+           resolution = 72;
+
    if (!showtext && !showxml && !showtime && !showmd5 && !output)
    {
        printf("nothing to do\n");

Best Answer

Edit: batch processing using Acrobat Pro

Related Solutions

Finding Image resolution in PDF file

PDF => raster, is it possible to adapt the sampling resolution to the input page size

Related Question