My PDF contains 600 pages with images of text.
It has 2 layers.
-
Layer 1: Background colour image
-
Layer 2: Text image
I would like to remove all background image layers in the total PDF file as shown in the image.
Could you suggest me any software/tool?
Best Answer
Overview
What you are looking for are tools like Scan Tailor and unpaper that are capable of Thresholding, Despeckling, and Noise Removal. Both tools work with images rather than PDF files but you can easily convert between the different formats these applications use and PDF by using the tools described at the end of this answer.
ScanTailor
You can find a video tutorial here. More extensive documentation is available on the official wiki. You will probably be most interested in the page on black and white output mode and filter settings.
Unpaper
I haven't worked with
unpaper
myself, yet. From what I understand it has far more features than ScanTailor but it's also much harder to master.There is no GUI interface and you will have to rely on command line switches to get your work done. On the other hand this means that conversions with
unpaper
can easily be automated using scripts.You can find some scripting examples concerning converting a scan to black and white and removing the background here.
Some helpful tools when working with unpaper and ScanTailer
I don't have enough time to write up a full tutorial on ScanTailor and unpaper¹ but here are some pointers concerning converting between
.pdf
and the image formats supported by these tools:You can use
pdfimages
to convert PDF documents to single page.ppm
files, which can be read byunpaper
.Usage example:
ScanTailor doesn't take
.ppm
files as an input. You will have to convert them to another format like the loss-less.png
first.mogrify
out of theimagemagick
tool suite can do this for you.Usage example:
The output format of ScanTailor and unpaper are single page
.tiff
files. In order to convert them back to.pdf
I would suggest usingtiffcp
andtiff2pdf
.Usage example:
Installation
This command will install all of the tools mentioned above:
¹: To anyone reading this, please feel free to compile a more extensive answer based on ScanTailor and/or unpaper.