PDF => raster, is it possible to adapt the sampling resolution to the input page size

ghostscriptimagemagickpdf

I am using convert (Imagemagick component, delegating to Ghostscript in background) to transform the first page of PDF files to images.

Usually, convert -density 200 file.pdf[0] first_page.png will do the job, and it will sample the PDF file at 200 pixels per inch of paper.

However it seldom happens that some PDF are abnormally huge (sometimes A0 paper, and recently a PDF with a page exceeding 23 m² (183 inch in length, 185 in width).

For such files, convert will hang, eat CPU time. Images of 35000+ pixels in width and height are simply not usable.

Therefore the question: is there a switch in Imagemagick that would adapt the density to the page size, or at least specify that we don't want to sample more than a portion of maximal area of the PDF file (top left corner, 30×30 inch for example)?

Thanks.

EDIT: On its official git repository, MuPDF has added the -w and -h switches that, jointly with -r will do what is wanted here.

Best Answer

I modified mupdf's pdfdraw to support drawing in best fit mode, so I could state that the output needed to be 128x128 at most and it would fit the output in the box while maintaining the aspect ratio. Before I did that the only way was to use pdfinfo to get the page size and then do the calcuations to fit it in a box and then ask pdfdraw to draw it with that scale factor (dots per inch).

Well, after that long story the process to do that is rather simple:

get the page size of the page to render (in pdf terms the media box) this can be done via pdfinfo and grep and will appear in pts (points, 1/72th of an inch) or via a pdf library like pyPDF like:
```
import pyPdf
p = pyPdf.PdfFileReader(file("/home/dan/Desktop/Sieve-JFP.pdf", "rb"))
x,y,w,h = p.pages[0]['/MediaBox']
```
for a box fit do dpi = min( A/(w/72.), B/(h/72.) )
where A is the maximum width and B is the maximum height; w and h are the width and height of the page.
pass dpi to convert -density $dpi

and as requested a slightly fudged git commit diff:

commit 0000000000000000000000000000000000000000
Author: Dan D.
Date:   Thu Jul 28 16:33:33 2011 -0400

    add options to pdfdraw to limit the output's width and height

    note that scaling must occur before rotation

diff --git a/apps/pdfdraw.c b/apps/pdfdraw.c
index 0000000..1234567 100644
--- a/apps/pdfdraw.c
+++ b/apps/pdfdraw.c
@@ -12,8 +12,10 @@
 #endif

 char *output = NULL;
-float resolution = 72;
+float resolution = -1;
 float rotation = 0;
+float width = -1;
+float height = -1;

 int showxml = 0;
 int showtext = 0;
@@ -47,6 +49,8 @@ static void usage(void)
        "\t\tsupported formats: pgm, ppm, pam, png, pbm\n"
        "\t-p -\tpassword\n"
        "\t-r -\tresolution in dpi (default: 72)\n"
+       "\t-w -\tmaximum width (default: no limit)\n"
+       "\t-h -\tmaximum height (default: no limit)\n"
        "\t-A\tdisable accelerated functions\n"
        "\t-a\tsave alpha channel (only pam and png)\n"
        "\t-b -\tnumber of bits of antialiasing (0 to 8)\n"
@@ -150,13 +154,39 @@ static void drawpage(pdf_xref *xref, int pagenum)

    if (output || showmd5 || showtime)
    {
-       float zoom;
+       float zoom = 1.0;
        fz_matrix ctm;
        fz_bbox bbox;
        fz_pixmap *pix;
+       float W, H;

-       zoom = resolution / 72;
-       ctm = fz_translate(0, -page->mediabox.y1);
+       ctm = fz_identity;
+       ctm = fz_concat(ctm, fz_translate(0, -page->mediabox.y1));
+       ctm = fz_concat(ctm, fz_rotate(page->rotate));
+       ctm = fz_concat(ctm, fz_rotate(rotation));
+       bbox = fz_round_rect(fz_transform_rect(ctm, page->mediabox));
+
+       W = bbox.x1 - bbox.x0; 
+       H = bbox.y1 - bbox.y0;
+       if (resolution != -1)
+           zoom = resolution / 72;
+       if (width != -1) 
+       {
+           if (resolution != -1)
+               zoom = MIN(zoom, width/W);
+           else
+               zoom = width/W;
+       }
+       if (height != -1)
+       {
+           if (resolution != -1 || width != -1)
+               zoom = MIN(zoom, height/H);
+           else
+               zoom = height/H;
+       }
+
+       ctm = fz_identity;
+       ctm = fz_concat(ctm, fz_translate(0, -page->mediabox.y1));
        ctm = fz_concat(ctm, fz_scale(zoom, -zoom));
        ctm = fz_concat(ctm, fz_rotate(page->rotate));
        ctm = fz_concat(ctm, fz_rotate(rotation));
@@ -295,7 +325,7 @@ int main(int argc, char **argv)
    fz_error error;
    int c;

-   while ((c = fz_getopt(argc, argv, "o:p:r:R:Aab:dgmtx5")) != -1)
+   while ((c = fz_getopt(argc, argv, "o:p:r:R:w:h:Aab:dgmtx5")) != -1)
    {
        switch (c)
        {
@@ -303,6 +333,8 @@ int main(int argc, char **argv)
        case 'p': password = fz_optarg; break;
        case 'r': resolution = atof(fz_optarg); break;
        case 'R': rotation = atof(fz_optarg); break;
+       case 'w': width = atof(fz_optarg); break;
+       case 'h': height = atof(fz_optarg); break;
        case 'A': accelerate = 0; break;
        case 'a': savealpha = 1; break;
        case 'b': alphabits = atoi(fz_optarg); break;
@@ -321,6 +353,10 @@ int main(int argc, char **argv)
    if (fz_optind == argc)
        usage();

+   if (width+height == -2)
+       if (resolution == -1)
+           resolution = 72;
+
    if (!showtext && !showxml && !showtime && !showmd5 && !output)
    {
        printf("nothing to do\n");

Software Requirements

The following software packages are available for both Windows and Linux systems, and are required for a complete, working solution:

gvim - Used to export syntax highlighted source code to HTML.
moria - Colour scheme for syntax highlighting.
wkhtmltoimage - Used to convert HTML documents to PNG files.
gawk and sed - Text processing tools.
ImageMagick - Used to trim the PNG and add a border.

General Steps

Here is how the solution works:

Load the source code into an editor that can add splashes of colour.
Export the source code as an HTML document (with embedded FONT tags).
Strip the background attribute from the HTML document (to allow transparency).
Convert the HTML document to a PNG file.
Trim the PNG border.
Add a small, 25 pixel border around the image.
Delete temporary files.

The script generates images that are all the same width for source files containing lines that are all under 80 characters in length. Source files with lines over 80 characters long result in images as wide as necessary to retain the entire line.

Installation

Install the components into the following locations:

gvim - C:\Program Files\Vim
moria - C:\Program Files\Vim\vim73\colors
wkhtmltoimage - C:\Program Files\wkhtml
ImageMagick - C:\Program Files\ImageMagick
Gawk and Sed - C:\Program Files\GnuWin32

Note: ImageMagick has a program called convert.exe, which cannot supersede the Windows convert command. Because of this, the full path to convert.exe must be hard-coded in the batch file (as opposed to adding ImageMagick to the PATH).

Environment Variables

Set the PATH environment variable to:

"C:\Program Files\Vim\vim73";"C:\Program Files\wkhtml";"C:\Program Files\GnuWin32\bin"

Batch File

Run it using:

src2png.bat src2png.bat

Create a batch file called src2png.bat by copying the following contents:

@ECHO OFF

SET NUMBERS=-c "set number"
IF "%2" == "" SET NUMBERS=

ECHO Converting %1 to %1.html...
gvim -e %1 -c "set nobackup" %NUMBERS% -c ":colorscheme moria" ^
  -c :TOhtml -c wq -c :q

REM Remove all background-color occurrences (without being self-referential)
sed -i "s/background-color: #......; \(.*\)}$/\1 }/g" %1.html

ECHO Converting %1.html to %1.png...
wkhtmltoimage --format png --transparent --minimum-font-size 80 ^
  --quality 100 --width 3600 ^
  %1.html %1.png

move %1.png %1.orig.png

REM If the text file has lines that exceed 80 characters, don't crop the
REM resulting image. (The book automatically shrinks large images to fit.)
REM The 3950 is the 80 point font at 80 characters with padding for line
REM numbers.
SET LENGTH=0
FOR /F %%l IN ('gawk ^
  "BEGIN {x=0} {if( length($0)>x ) x=length()} END {print x;}" %1') ^
DO (
  SET LENGTH=%%l
)
SET EXTENT=-extent 3950x
IF %LENGTH% GTR 80 SET EXTENT=

REM Trim the image height, then extend the width for 80 columns, if needed.
REM The result is that all images will be resized the same amount, thus
REM making the font size the same maximum for all source listings. Source
REM files beyond the 80 character limit will be scaled as necessary.
ECHO Trimming %1.png...
"C:\programs\ImageMagick\convert.exe" -format png %1.orig.png ^
  -density 150x150 ^
  -background none -antialias -trim +repage ^
  %EXTENT% ^
  -bordercolor none -border 25 ^
  %1.png

ECHO Removing old files...
IF EXIST %1.orig.png DEL /q %1.orig.png
IF EXIST %1.html DEL /q %1.html
IF EXIST sed*. DEL /q sed*.

Improvements and optimizations welcome.

Note: The latest version of wkhtmltoimage properly handles overriding the background colour. Thus the line to remove the CSS for background colours is no longer necessary, in theory.

How to make all pages in a PDF the same size

The output of your pdfinfo command clearly shows that all your pages have the same size: 612 x 792 pt ("MediaBox"), which is the 'letter' size.

However, page 1 is set to tell the PDF viewer (and the printer) to display only a part the page area ("CropBox").

If you remove the CropBox (as well as the Art-, Bleed- and TrimBox-es) settings in your PDF with a tool of your choice (even a text editor), or if you change them to the same values as the MediaBox, you'll have PDF viewers (and printers) show the full page content (even if the "rest" is only white margins). Note: it may happen that there is hidden content on your first page which becomes visible if you remove the Crop-, Trim-, Bleed- and ArtBoxes.