Get printer-ready black text on white background in scanned pdf files (remove grayscale or color background)

image processingpdfsoftware-recommendation

How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended – also here – seems to lack a Linux version now.)

This is not about OCR and not about converting image to text.

To clarify what I mean I will post a few examples.

There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:

The above is not what I discuss here.

What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.

The first are formed of images that look like pictures taken of book pages:

Such copies can hardly be re-printed on paper, as the background will be printed too.

The second ones are what one would expect from scanned text, and can be printed:

The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.

What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.

As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.

As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.

Best Answer

scantailor is not maintained anymore but you can still build it from source and use it.

However, the original repository needs qt4, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted to qt5.

Prerequisites:

sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

Installation:

git clone https://github.com/victl/scantailor
cd scantailor
cmake .
make
sudo make install

Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.

Another option would be to use Scantailor advanced. You can install it via snap ...

sudo snap install scantailor-advanced

... or flatpak.

... or via ppa.

sudo add-apt-repository ppa:alex-p/scantailor
sudo apt update
sudo apt install scantailor # or scantailor-advanced

Quick test:

Openoffice

Install the PDF Import Extension from Oracle into your Extension Manager for OpenOffice and you will be able to open and edit your PDF files inside of OpenOffice Draw. Which will create all the elements (text, lines, drawings, etc.) and you will be able to remove those that you don't wish. A screenshot is here:

enter image description here

Gimp

If you prefer to handle your pdf pages as layers and edit'em as images, then you can right click the PDF file and choose "Open with GIMP Image Editor", the "Import from PDF" dialog will show after opening gimp and will allow you to choose which pages you wish to edit with several options as shown in the next screenshot.

enter image description here

After which you will also be able to edit those pages as shown in this screenshot:

enter image description here

Good luck!

Ubuntu – How to edit text in a scanned .jpeg

To make text in a .jpeg editable you need Optical Character Recognition (OCR) software. I use ocrfeeder.

sudo apt-get install ocrfeeder

To open an image file click on the 'plus' (+) sign.
enter image description here After you have opened the image, click on the next icon to the right to run OCR.

After is has finished OCR'ing the image, you can select the text you want on the left, and copy it out on the right.

The easiest way to get the text out is to just copy it over to LibreOffice. With a little editing, my copy looks very similar.

enter image description here

After you make the required changes you can export them as .pdf by clicking 'export as pdf' from the LibreOffice toobar. enter image description here

Ultimately its best to scan to .pdf if you can. If you can't this works very well.

NOTE: OCR is not 100% accurate, you may have to correct errors, and the more formatting your document has, the harder it will be.

Best Answer

Related Solutions

Ubuntu – Remove text information from a PDF

Openoffice

Gimp

Ubuntu – How to edit text in a scanned .jpeg

Related Question