I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?
Delete OCR from PDF
ocrpdf
Related Solutions
See the pdfseparate
and pdfunite
commands from poppler-utils
. The first to separate the pages from each document into individual files, and the second to merge them in the order you want in a new document.
Also note that since scanners give you raster images anyway (which some like yours can concatenate into a PDF files), maybe you can configure it to output images (png, tiff...) instead, and do the concatenation into a PDF yourself with ImageMagick.
Here is how I would remove the OCR-ed text should I have to...
First, you need to know, that OCR-ed text in a PDF is not a layer, but a special text rendering mode. The following screenshot from the official PDF specification lists all available text rendering modes:
For more background, please see these answers of mine on StackOverflow:
Now for the procedure I envisage:
0. Make a backup of your original PDF file
'nuff said...
1. Use qpdf
to un-compress most of the PDF objects
qpdf
is a beautiful command line tool to transform most PDFs into a form that makes it easier to manipulate through a text editor (or through sed
):
qpdf \
--qdf \
--object-streams=disable \
input.pdf \
editable.pdf
2. Search for spots where PDF code contains 3 Tr
All spots in the editable.pdf
where there is 'invisible' (a.k.a. neither filled nor stroked) text is marked by an initial definition of
3 Tr
Change these to now read
1 Tr
This should make the previously hidden text visible. Glyphs will appear in thick outlines, overlaying the original scanned page images.
It will look very ugly.
Save the edited PDF.
3. Change Tj
and TJ
text stroking operators to 'no-ops'
Whenever a text string is prepared for being rendered, the actual operator that is responsible for doing so is named Tj
or TJ
.
Look out for all of these. Replace them by tJ
and tj
. This will change them into 'no-ops': they have no meaning at all in the PDF source code; no PDF viewer or processor will "understand" them. (Be careful not to change the number of bytes when replacing stuff in PDF source code, because otherwise you may cause it to become "corrupted".)
Save the PDF file.
4. Check how the PDF file looks now
The PDF should now look "clean" again. The renamed text operators do not have any meaning any more for the PDF viewer, nor for any PDF interpreter.
5. Use Ghostscript to create the final PDF
This command should achieve what you want:
gs \
-o final.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
editable.pdf
This final step uses editable.pdf
as input. It outputs final.pdf
. The output will have removed all traces of text. The input still had the text, albeit in an "unusable" form, because the operator renaming. Since Ghostscript does not "understand" the re-named operators, it will simply skip them by default.
Best Answer
The command given by @dirkt didn't work for me and infact it decreased file size from 560Mb to 300 & some Mb but I didn't check with diffpdf so don't know what changed between the files.
What worked for me is Apache Pdfbox and Pdfbox developers have provided a nice little program in examples to remove text and for other things, but since I don't have any experience with java (or anything except bash for that matter) what I did was install openjdk-11-jdk-headless and libpdfbox-java.
Steps:
jar xf pdfbox2.jar
.org/apache/pdfbox/examples/util
.javac org/apache/pdfbox/examples/util/RemoveAllText.java
.java org.apache.pdfbox.examples.util.RemoveAllText
.If someone comes across this answer and knows better way to do this please comment.