Ubuntu – How to convert a scanned PDF into a PDF with text

pdf

I have scanned about 80 pages into gray scale pdf (image format).
The end size of the file is about 70MB, which is very huge.

Now I am looking for a method to convert the grayscale image-based PDF file into a simple black/white text-based PDF file.

I have done many attempts with gs but with no success (only a few percent recovery).
If any expert has some idea, kindly let me know.

Best Answer

gImageReader is a simple GTK+ front-end to tesseract-ocr.

sudo apt-get install gimagereader tesseract-ocr

sorry for the german text

Related Solutions

Ubuntu – How to edit a picture into an existing PDF file

My recommendation is Xournal and its actively developed fork, Xournal++. Here are the instructions.

Install (for Xournal):

sudo apt-get install xournal

For Xournal++ you can use either the stable PPA,

sudo add-apt-repository ppa:apandada1/xournalpp-stable
sudo apt update
sudo apt install xournalpp

or the flatpak,

flatpak install flathub com.github.xournalpp.xournalpp

Run xournal or xournal++, click File>Annotate PDF, choose your PDF file.

Now, go to where you need to add your signature and click Tools>Image (or the "Image" toolbar icon), then click where you want to add the image. An image selection dialog appears, select your image.

Xournal's insert image is a great addition but not polished. As soon as you add the image make sure to resize it and move it to where you want. For resize there's no ability to ensure the proportions stay the same. Just eye it. Once you are done, it is in its own layer, which you cannot change. If you don't like how it ends up delete that layer and start again.

One handy thing is that you can use ctrl-c as soon as you resize it and then ctrl-v the next time you need to insert your image. Assuming you want the same size image this will save you some time.

When you are done choose File->Export to PDF to get it back into the PDF format I assume you'll want for sending your signed doc.

Note: A downside to Xournal is the finished document looks like the fonts are converted to an image. Fonts are no longer as crisp. Still it looks better than if you printed and rescanned and is much faster. [Note: in my most recent experience it seems this problem has been solved. Maybe I just got lucky with the particular fonts used. Please leave a comment abt your experience and I'll update accordingly.] This issue seems to be fixed in Xournal++ version 1.0.20.

Ubuntu – How to turn a pdf into a text searchable pdf

As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run

sudo apt install ocrmypdf
ocrmypdf -h   # to see the usage

Finally you can OCR your pdf with the command:

ocrmypdf input.pdf output.pdf  # change input and output to the files you want

If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:

pdftk A=input.pdf cat A1-5 output output.pdf

If you have any question have a look in the new Github Repo.

Best Answer

Related Solutions

Ubuntu – How to edit a picture into an existing PDF file

Ubuntu – How to turn a pdf into a text searchable pdf

Related Question