Ubuntu – How to turn a pdf into a text searchable pdf

ocrpdfsoftware-recommendation

I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?

Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).

  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)
  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.
  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?

Best Answer

As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run

sudo apt install ocrmypdf
ocrmypdf -h   # to see the usage

Finally you can OCR your pdf with the command:

ocrmypdf input.pdf output.pdf  # change input and output to the files you want

If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:

pdftk A=input.pdf cat A1-5 output output.pdf

If you have any question have a look in the new Github Repo.