I have some PDF files that I want to split apart into TIFF files using convert
(in order to OCR via tesseract
). This so far is working great – except that in order to automate the whole process, I need to set the DPI of the convert
output. Right now, I am using a command like this:
convert -density 300 myFile.pdf -depth 8 -background white output-%04d.tiff
… which outputs the PDF files at 300 DPI. However, some PDF files have lower DPI (e.g. 150 DPI) which means that I don't want to output them at 300 DPI via convert
– this creates excessively large TIFF files without any additional information.
I know that there are ways to check the DPI of images in a PDF file by opening Adobe Acrobat and messing around in the "preflight" tools. However, is there a way to determine via the command line the DPI of a particular PDF file?
Best Answer
Main answer
Since I am interested in the same kind of job (though not necessarily to OCR the PDF files, but to convert them to DjVu and then OCR them), I found this question and the responses lacking (since I needed to guess the DPI of the images with the number of pixels and then use the size as output by
pdfinfo
or other tricks---not to mention that the images inside a PDF may have different densities etc.).After a lot of research more, I found that you can use
pdfimages
(from package poppler-utils) like the following:Notice the
x-ppi
andy-ppi
at the listing above. It also lists the format in which the images are stored in the PDF, which is cool (sometimes, it is JBIG2, sometimes JPEG2000 etc.)Note: The file
deptest.pdf
used above is available frompdfsizeopt
's repository.The real action
After that, you can simply extract the images with
pdfimages
itself or usepdftoppm
(also frompoppler-utils
) to render entire pages in many formats that you may like (e.g., tiff, for scanning withtesseract
).You can use something like the following (assuming you have created a directory named
imgs
where you will put your images):The files will be created inside the directory
imgs
with names starting withprefix
, as in:You can, then, perform any surgery that you see fit with tools like
scantailor
or whatever you like.More direct answer
If you just want to OCR a PDF file, you can use a program that is well-maintained and already packaged, namely ocrmypdf.