First, apologies if this has been asked before – I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?
This seems to describe a solution – but unfortunately I am already lost when retrieving exact-image.
Best Answer
Best and easiest way out there is to use
pypdfocr
as it doesn't change the pdf. pypdfocr is a python module link here.At the end you will have another
your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.I think the command is pretty easy that it doesn't need any GUI. Maybe installing pypdfocr is a bit more verbose:
Update 3rd november 2018:
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being maintained.ocrmypdf
(module) does a similar job and can be used like this:To install:
or