How to use OCR from the command line in Linux

command lineocr

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations.

I need to create a list of all of the words appearing in each JPG file. Is there a command line tool for scanning an image listing the words that appear? It does not need to have perfect scanning, just an estimate.

Best Answer

tesseract is probably the most-used solution here. It's available in most package repositories, e.g.,

sudo apt install tesseract-ocr

and can be used with

tesseract input.png out.txt

Related Solutions

Linux – How to use the ul command line utility

The input format requires character-backspace-underscore or character-backspace-letter to underline a character. You also get boldface with character-backspace-character.

echo $'hello k\b_i\b_t\b_t\b_y\b_ world' | ul

Less does a similar transformation automatically.

How to invoke an Openoffice macro from the Linux command line

The flag you want is -invisible. See this example, adapted from http://ubuntuforums.org/showthread.php?t=786697

ooffice -invisible macro:///Standard.Module1.SaveAsXHTML("/tmp/somefile.rtf")

Best Answer

Related Solutions

Linux – How to use the ul command line utility

How to invoke an Openoffice macro from the Linux command line

Related Question