Ubuntu – How to programmatically determine DPI of images in PDF file

command linedisplay-resolutionpdf

I have some PDF files that I want to split apart into TIFF files using convert (in order to OCR via tesseract). This so far is working great – except that in order to automate the whole process, I need to set the DPI of the convert output. Right now, I am using a command like this:

convert -density 300 myFile.pdf -depth 8 -background white output-%04d.tiff

… which outputs the PDF files at 300 DPI. However, some PDF files have lower DPI (e.g. 150 DPI) which means that I don't want to output them at 300 DPI via convert – this creates excessively large TIFF files without any additional information.

I know that there are ways to check the DPI of images in a PDF file by opening Adobe Acrobat and messing around in the "preflight" tools. However, is there a way to determine via the command line the DPI of a particular PDF file?

Best Answer

Main answer

Since I am interested in the same kind of job (though not necessarily to OCR the PDF files, but to convert them to DjVu and then OCR them), I found this question and the responses lacking (since I needed to guess the DPI of the images with the number of pixels and then use the size as output by pdfinfo or other tricks---not to mention that the images inside a PDF may have different densities etc.).

After a lot of research more, I found that you can use pdfimages (from package poppler-utils) like the following:

$ pdfimages -list deptest.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   1  image  no         9  0    53    53  169B  14%
   2     1 image     100   100  gray    1   1  ccitt  no   [inline]      53    53  698B  56%

Notice the x-ppi and y-ppi at the listing above. It also lists the format in which the images are stored in the PDF, which is cool (sometimes, it is JBIG2, sometimes JPEG2000 etc.)

Note: The file deptest.pdf used above is available from pdfsizeopt's repository.

The real action

After that, you can simply extract the images with pdfimages itself or use pdftoppm (also from poppler-utils) to render entire pages in many formats that you may like (e.g., tiff, for scanning with tesseract).

You can use something like the following (assuming you have created a directory named imgs where you will put your images):

pdfimages -png Faraway-PRA.pdf imgs/prefix

The files will be created inside the directory imgs with names starting with prefix, as in:

$ ls 
prefix-000.png  prefix-047.png  prefix-094.png  prefix-141.png
prefix-001.png  prefix-048.png  prefix-095.png  prefix-142.png
prefix-002.png  prefix-049.png  prefix-096.png  prefix-143.png
prefix-003.png  prefix-050.png  prefix-097.png  prefix-144.png
(...)

You can, then, perform any surgery that you see fit with tools like scantailor or whatever you like.

More direct answer

If you just want to OCR a PDF file, you can use a program that is well-maintained and already packaged, namely ocrmypdf.