Delete OCR from PDF

ocrpdf

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?

Best Answer

The command given by @dirkt didn't work for me and infact it decreased file size from 560Mb to 300 & some Mb but I didn't check with diffpdf so don't know what changed between the files.

What worked for me is Apache Pdfbox and Pdfbox developers have provided a nice little program in examples to remove text and for other things, but since I don't have any experience with java (or anything except bash for that matter) what I did was install openjdk-11-jdk-headless and libpdfbox-java.

Steps:

  1. Copy pdfbox2.jar, fontbox2.jar, commons-logging.jar (needed by some class in pdfbox2) to a folder.
  2. Extract Jar files e.g. jar xf pdfbox2.jar.
  3. Get the Pdfbox source for same version as installed.
  4. Copy RemoveAllText.java to the folder org/apache/pdfbox/examples/util .
  5. Compile RemoveAllText.java javac org/apache/pdfbox/examples/util/RemoveAllText.java.
  6. Now you can run it, this will show usage java org.apache.pdfbox.examples.util.RemoveAllText.

If someone comes across this answer and knows better way to do this please comment.

Related Question