Command Line – How to OCR a PDF and Extract Text

command lineocrpdf

First, apologies if this has been asked before – I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?

This seems to describe a solution – but unfortunately I am already lost when retrieving exact-image.

Best Answer

Best and easiest way out there is to use pypdfocr as it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

I think the command is pretty easy that it doesn't need any GUI. Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 
pip install pypdfocr 

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being maintained. ocrmypdf(module) does a similar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

or

sudo apt install ocrmypdf #ubuntu
sudo dnf -y install ocrmypdf #fedora
Related Question