How to replaces images of text in PDFs with formatted text using OCR

automationdocumentsocrpdf

I get a lot of PDFs from other people consisting of scanned old documents. Unfortunately, sometimes the text on the scans, though legible, looks grainy and is hard to read.

What I've been able to do so far is to extract the text, using OCR, into a word document. However, since these old documents often have illustrations and intricate formatting, what I'd really like to be able to do is to just remove the old grainy text and substitute it with computer generated fonts. In other words, I'd like to preserve the PDF and the formatting of its pages to the greatest extent possible while "cleaning" up the text by replacing it with, say, times new roman.

I've been looking online for a few days for a simple, automatable way to perform such a cleanup, and I haven't turned up anything so far. It definitely seems like there should be a way to do this, it doesn't seem that complicated, but maybe I'm overlooking some aspects of this problem that place it outside of what is currently doable with OCR.

Any suggestions?

Best Answer

Even Adobe's own software is not good at doing this or making clear how to do it.

With Adobe Acrobat X, you can create a text layer through the menus (View | Tools | Recognize Text) or by click Tools in the toolbar and then Recognize Text in the Tools pane.

You then have options to perform OCR on the document or find "suspects". The "suspects" are possible OCR results that don't look right (don't spellcheck?). Once you have gone through the suspects, there doesn't seem to be any way to access or edit the text layer again short of redoing the OCR.

You can choose page ranges to limit OCR (e.g. if you have a multilingual document), but you can't limit it to a selection.

Given that this is such a useful feature, it's disappointing that Adobe don't make it very user-friendly.

Edit: Two other possible solutions.

Adobe Acrobat using ClearScan

When you perform OCR with Adobe Acrobat you can change the PDF Output Style from the default Searchable Image format to ClearScan. This format will actually change the image as well, replacing characters with outlines derived from the OCR. This would both make your PDF more readable and add a text layer, but it does change the original image.

Infix PDF Editor

This program does seem to be able to display the text layer, but it still seems tricky fixing places where Adobe's OCR goes wrong (e.g. lone words in their own positioned para).

Sadly none of these options are freely available.

Related Question