How to use OCR from the command line in Linux

command lineocr

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations.

I need to create a list of all of the words appearing in each JPG file. Is there a command line tool for scanning an image listing the words that appear? It does not need to have perfect scanning, just an estimate.

Best Answer

tesseract is probably the most-used solution here. It's available in most package repositories, e.g.,

sudo apt install tesseract-ocr

and can be used with

tesseract input.png out.txt
Related Question