Ubuntu – How to remove the gray-scale page background of a PDF document scan while preserving the text? (Binarization)

image processingpdfsoftware-recommendation

My PDF contains 600 pages with images of text.
It has 2 layers.

  • Layer 1: Background colour image

  • Layer 2: Text image

I would like to remove all background image layers in the total PDF file as shown in the image.

enter image description here

Could you suggest me any software/tool?

enter image description here

Best Answer

Overview

What you are looking for are tools like Scan Tailor and unpaper that are capable of Thresholding, Despeckling, and Noise Removal. Both tools work with images rather than PDF files but you can easily convert between the different formats these applications use and PDF by using the tools described at the end of this answer.

ScanTailor

You can find a video tutorial here. More extensive documentation is available on the official wiki. You will probably be most interested in the page on black and white output mode and filter settings.

Unpaper

I haven't worked with unpaper myself, yet. From what I understand it has far more features than ScanTailor but it's also much harder to master.

There is no GUI interface and you will have to rely on command line switches to get your work done. On the other hand this means that conversions with unpaper can easily be automated using scripts.

You can find some scripting examples concerning converting a scan to black and white and removing the background here.


Some helpful tools when working with unpaper and ScanTailer

I don't have enough time to write up a full tutorial on ScanTailor and unpaper¹ but here are some pointers concerning converting between .pdf and the image formats supported by these tools:

  • You can use pdfimages to convert PDF documents to single page .ppm files, which can be read by unpaper.

    Usage example:

    pdfimages *.pdf ./extracted-images
    
  • ScanTailor doesn't take .ppm files as an input. You will have to convert them to another format like the loss-less .pngfirst. mogrify out of the imagemagick tool suite can do this for you.

    Usage example:

    mogrify -format png *.ppm
    
  • The output format of ScanTailor and unpaper are single page .tiff files. In order to convert them back to .pdf I would suggest using tiffcp and tiff2pdf.

    Usage example:

    tiffcp *.tiff all.tiff
    tiff2pdf -F -p A4 -z -o Document.pdf all.tiff
    

Installation

This command will install all of the tools mentioned above:

sudo apt-get install scantailor unpaper poppler-utils libtiff-tools

¹: To anyone reading this, please feel free to compile a more extensive answer based on ScanTailor and/or unpaper.