Command Line – How to OCR a PDF and Extract Text

command lineocrpdf

First, apologies if this has been asked before – I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?

This seems to describe a solution – but unfortunately I am already lost when retrieving exact-image.

Best Answer

Best and easiest way out there is to use pypdfocr as it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

I think the command is pretty easy that it doesn't need any GUI. Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 
pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being maintained. ocrmypdf(module) does a similar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

sudo apt install ocrmypdf #ubuntu
sudo dnf -y install ocrmypdf #fedora

Related Solutions

PDF – How to View and Edit the Code of a PDF File

You can use sed with binary files (at least GNU sed; some implementations may have trouble with files containing null characters or not ending with a newline character). But the command you used only replaces the first occurrence of /Fit on each line, and lines are pretty much meaningless in a PDF file. You need to replace all occurrences:

 sed s/\/Fit/\/XYZ/g

It would be more robust only replace /Fit if it's not followed by a word constituent (e.g. not replacing /Fitness; I don't know if your file contains occurrences of /Fit that would cause trouble). Here's one way:

perl -pe 's!/Fit\b!/XYZ!g'

Linux – How to rasterize all of the text in a PDF

You could test out if image based PDF's are polluted as well. First convert PDF to (multipage) TIFF, e.g. with ghostscript:

gs -sDEVICE=tiffg4 -o sample.tif sample.pdf

Then convert the TIFF to PDF, e.g.:

tiff2pdf -z -f -F -pA4 -o sample-img.pdf sample.tif

This result in a PDF file where the pages are images instead of text.

Alternatively, if your system supports printing of TIFF files try to print it directly.

There is also the option of pdf2ps for converting PDF to PS, which if works, would likely be preferable.

Best Answer

Update 3rd november 2018:

Related Solutions

PDF – How to View and Edit the Code of a PDF File

Linux – How to rasterize all of the text in a PDF

Related Question