Delete OCR from PDF

ocrpdf

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?

Best Answer

The command given by @dirkt didn't work for me and infact it decreased file size from 560Mb to 300 & some Mb but I didn't check with diffpdf so don't know what changed between the files.

What worked for me is Apache Pdfbox and Pdfbox developers have provided a nice little program in examples to remove text and for other things, but since I don't have any experience with java (or anything except bash for that matter) what I did was install openjdk-11-jdk-headless and libpdfbox-java.

Steps:

Copy pdfbox2.jar, fontbox2.jar, commons-logging.jar (needed by some class in pdfbox2) to a folder.
Extract Jar files e.g. jar xf pdfbox2.jar.
Get the Pdfbox source for same version as installed.
Copy RemoveAllText.java to the folder org/apache/pdfbox/examples/util .
Compile RemoveAllText.java javac org/apache/pdfbox/examples/util/RemoveAllText.java.
Now you can run it, this will show usage java org.apache.pdfbox.examples.util.RemoveAllText.

If someone comes across this answer and knows better way to do this please comment.

0. Make a backup of your original PDF file

'nuff said...

1. Use `qpdf` to un-compress most of the PDF objects

qpdf is a beautiful command line tool to transform most PDFs into a form that makes it easier to manipulate through a text editor (or through sed):

qpdf                       \
  --qdf                    \
  --object-streams=disable \
    input.pdf              \
    editable.pdf

2. Search for spots where PDF code contains `3 Tr`

All spots in the editable.pdf where there is 'invisible' (a.k.a. neither filled nor stroked) text is marked by an initial definition of

3 Tr

Change these to now read

1 Tr

This should make the previously hidden text visible. Glyphs will appear in thick outlines, overlaying the original scanned page images.

It will look very ugly.

Save the edited PDF.

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'

Whenever a text string is prepared for being rendered, the actual operator that is responsible for doing so is named Tj or TJ.

Look out for all of these. Replace them by tJ and tj. This will change them into 'no-ops': they have no meaning at all in the PDF source code; no PDF viewer or processor will "understand" them. (Be careful not to change the number of bytes when replacing stuff in PDF source code, because otherwise you may cause it to become "corrupted".)

Save the PDF file.

4. Check how the PDF file looks now

The PDF should now look "clean" again. The renamed text operators do not have any meaning any more for the PDF viewer, nor for any PDF interpreter.

5. Use Ghostscript to create the final PDF

This command should achieve what you want:

gs                        \
  -o final.pdf            \
  -sDEVICE=pdfwrite       \
  -dPDFSETTINGS=/prepress \
   editable.pdf

This final step uses editable.pdf as input. It outputs final.pdf. The output will have removed all traces of text. The input still had the text, albeit in an "unusable" form, because the operator renaming. Since Ghostscript does not "understand" the re-named operators, it will simply skip them by default.

Best Answer

Related Solutions

Merge PDF Files with Interleaving Pages Order – How to Guide

How to convert a scanned PDF with OCRed text to one without OCRed text

0. Make a backup of your original PDF file

1. Use qpdf to un-compress most of the PDF objects

2. Search for spots where PDF code contains 3 Tr

3. Change Tj and TJ text stroking operators to 'no-ops'

4. Check how the PDF file looks now

5. Use Ghostscript to create the final PDF

Related Question

1. Use `qpdf` to un-compress most of the PDF objects

2. Search for spots where PDF code contains `3 Tr`

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'