Ubuntu – How to programmatically determine DPI of images in PDF file

command linedisplay-resolutionpdf

I have some PDF files that I want to split apart into TIFF files using convert (in order to OCR via tesseract). This so far is working great – except that in order to automate the whole process, I need to set the DPI of the convert output. Right now, I am using a command like this:

convert -density 300 myFile.pdf -depth 8 -background white output-%04d.tiff

… which outputs the PDF files at 300 DPI. However, some PDF files have lower DPI (e.g. 150 DPI) which means that I don't want to output them at 300 DPI via convert – this creates excessively large TIFF files without any additional information.

I know that there are ways to check the DPI of images in a PDF file by opening Adobe Acrobat and messing around in the "preflight" tools. However, is there a way to determine via the command line the DPI of a particular PDF file?

Best Answer

Main answer

Since I am interested in the same kind of job (though not necessarily to OCR the PDF files, but to convert them to DjVu and then OCR them), I found this question and the responses lacking (since I needed to guess the DPI of the images with the number of pixels and then use the size as output by pdfinfo or other tricks---not to mention that the images inside a PDF may have different densities etc.).

After a lot of research more, I found that you can use pdfimages (from package poppler-utils) like the following:

$ pdfimages -list deptest.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   1  image  no         9  0    53    53  169B  14%
   2     1 image     100   100  gray    1   1  ccitt  no   [inline]      53    53  698B  56%

Notice the x-ppi and y-ppi at the listing above. It also lists the format in which the images are stored in the PDF, which is cool (sometimes, it is JBIG2, sometimes JPEG2000 etc.)

Note: The file deptest.pdf used above is available from pdfsizeopt's repository.

The real action

After that, you can simply extract the images with pdfimages itself or use pdftoppm (also from poppler-utils) to render entire pages in many formats that you may like (e.g., tiff, for scanning with tesseract).

You can use something like the following (assuming you have created a directory named imgs where you will put your images):

pdfimages -png Faraway-PRA.pdf imgs/prefix

The files will be created inside the directory imgs with names starting with prefix, as in:

$ ls 
prefix-000.png  prefix-047.png  prefix-094.png  prefix-141.png
prefix-001.png  prefix-048.png  prefix-095.png  prefix-142.png
prefix-002.png  prefix-049.png  prefix-096.png  prefix-143.png
prefix-003.png  prefix-050.png  prefix-097.png  prefix-144.png
(...)

You can, then, perform any surgery that you see fit with tools like scantailor or whatever you like.

More direct answer

If you just want to OCR a PDF file, you can use a program that is well-maintained and already packaged, namely ocrmypdf.

Related Solutions

Ubuntu – How to remove images from a PDF file

cpdf -draft original.pdf -o version_without_images.pdf

It is not in the repositories but you can find a download (pre-compiled or source) on their website.

Manual:

15.1 Draft Documents

The -draft option removes bitmap (photographic) images from a file, so that it can be printed with less ink. Optionally, the -boxes option can be added, filling the spaces left blank with a crossed box denoting where the image was. This is not guaranteed to be fully visible in all cases (the bitmap may be have been partially covered by vector objects or clipped in the original). For example:
 cpdf -draft -boxes in.pdf -o out.pdf

Ubuntu – Why won’t .pdf files containing transparent elements print correctly

Try to use Okular.
Say if it helped.
You can use apt install okular