Get printer-ready black text on white background in scanned pdf files (remove grayscale or color background)

image processingpdfsoftware-recommendation

How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended – also here – seems to lack a Linux version now.)


This is not about OCR and not about converting image to text.

To clarify what I mean I will post a few examples.

There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:

enter image description here

The above is not what I discuss here.

What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.

The first are formed of images that look like pictures taken of book pages:

enter image description here

or

enter image description here

Such copies can hardly be re-printed on paper, as the background will be printed too.

The second ones are what one would expect from scanned text, and can be printed:

enter image description here

or

enter image description here

The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.

What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.


As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.

As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.

Best Answer

scantailor is not maintained anymore but you can still build it from source and use it.

However, the original repository needs qt4, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted to qt5.

Prerequisites:

sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

Installation:

git clone https://github.com/victl/scantailor
cd scantailor
cmake .
make
sudo make install

Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.


Another option would be to use Scantailor advanced. You can install it via snap ...

sudo snap install scantailor-advanced

... or flatpak.

... or via ppa.

sudo add-apt-repository ppa:alex-p/scantailor
sudo apt update
sudo apt install scantailor # or scantailor-advanced

Quick test:

enter image description here