Tesseract: is it possible to change font output in OCRed pdf

evincefontsocrpdftesseract

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages.

In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them and paste them elsewhere successfully. This does not seem to be a bug of Evince: https://bugzilla.redhat.com/show_bug.cgi?id=1364201

When initiating an OCR of a pdf page with pdfsandwich, tesseract produces a page that

contains a font which doesn't have any
usable glyphs (they named it GlyphLessFont). It has only .notdef and
.null replacements (the squares). Evince uses the .notdef glyph if there
is no glyph for the character. The reason that Okular highlight the text
is because it does it in the image not as a regular text as evince does.

pdftotext recognises the characters.

Now, the question is: can tesseract be told to use a different font?

Best Answer

You could customise this portion of the source code to your liking and change the font here. You will have to rebuild tesseract from source once you make the change.

Tesseract Github Renderer.h

0. Make a backup of your original PDF file

'nuff said...

1. Use `qpdf` to un-compress most of the PDF objects

qpdf is a beautiful command line tool to transform most PDFs into a form that makes it easier to manipulate through a text editor (or through sed):

qpdf                       \
  --qdf                    \
  --object-streams=disable \
    input.pdf              \
    editable.pdf

2. Search for spots where PDF code contains `3 Tr`

All spots in the editable.pdf where there is 'invisible' (a.k.a. neither filled nor stroked) text is marked by an initial definition of

3 Tr

Change these to now read

1 Tr

This should make the previously hidden text visible. Glyphs will appear in thick outlines, overlaying the original scanned page images.

It will look very ugly.

Save the edited PDF.

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'

Whenever a text string is prepared for being rendered, the actual operator that is responsible for doing so is named Tj or TJ.

Look out for all of these. Replace them by tJ and tj. This will change them into 'no-ops': they have no meaning at all in the PDF source code; no PDF viewer or processor will "understand" them. (Be careful not to change the number of bytes when replacing stuff in PDF source code, because otherwise you may cause it to become "corrupted".)

Save the PDF file.

4. Check how the PDF file looks now

The PDF should now look "clean" again. The renamed text operators do not have any meaning any more for the PDF viewer, nor for any PDF interpreter.

5. Use Ghostscript to create the final PDF

This command should achieve what you want:

gs                        \
  -o final.pdf            \
  -sDEVICE=pdfwrite       \
  -dPDFSETTINGS=/prepress \
   editable.pdf

This final step uses editable.pdf as input. It outputs final.pdf. The output will have removed all traces of text. The input still had the text, albeit in an "unusable" form, because the operator renaming. Since Ghostscript does not "understand" the re-named operators, it will simply skip them by default.

Libvte unicode font handling

Actually, xterm (and xfd) can use TrueType fonts, using the -fa option. With the latter, you can see that 0xee20 is missing from the font:

Given that, xterm would show a missing glyph (because it uses only one font). Likely vte is doing the same because none of the fallback fonts which it may be using has this particular private use code.

For comparison, here is a screenshot with xfd displaying the PCF font (which does have the glyph):

Whether the fonts are "identical" depends on how they were created (and maintained). I see 7 missing glyphs on the screenshot of the TrueType font, and none on the PCF font.

Best Answer

Related Solutions

How to convert a scanned PDF with OCRed text to one without OCRed text

0. Make a backup of your original PDF file

1. Use qpdf to un-compress most of the PDF objects

2. Search for spots where PDF code contains 3 Tr

3. Change Tj and TJ text stroking operators to 'no-ops'