Ubuntu – How to produce a multi-page sandwich pdf with hocr2pdf

ocrpdf

I used tesseract to produce the special html to use with hocr2pdf starting from a muti-page tif.

I tried using hoc2pdf to produce a "sandwich pdf" (image + hidden text layer).

Hocr2pdf produces a one page pdf with all the pages superimposed.

Is there a way to solve this problem or an alternative solution?

Best Answer

I found a workaround to this issue. Hocr2pdf has issues with producing multi-page pdfs so I produced single-page tifs, ran tesseract-ocr, ran hocr2pdf then combined the results with the following script:

for f in ./*.tif; do
   tesseract "$f" "$f" -l fra hocr
   hocr2pdf -i "$f" -s -o "$f.pdf" < "$f.html"
done
pdftk *.tif.pdf cat output "output.pdf" && rm *.tif.pdf && rm *.tif.html

Related Solutions

Ubuntu – How to edit a picture into an existing PDF file

My recommendation is Xournal and its actively developed fork, Xournal++. Here are the instructions.

Install (for Xournal):

sudo apt-get install xournal

For Xournal++ you can use either the stable PPA,

sudo add-apt-repository ppa:apandada1/xournalpp-stable
sudo apt update
sudo apt install xournalpp

or the flatpak,

flatpak install flathub com.github.xournalpp.xournalpp

Run xournal or xournal++, click File>Annotate PDF, choose your PDF file.

Now, go to where you need to add your signature and click Tools>Image (or the "Image" toolbar icon), then click where you want to add the image. An image selection dialog appears, select your image.

Xournal's insert image is a great addition but not polished. As soon as you add the image make sure to resize it and move it to where you want. For resize there's no ability to ensure the proportions stay the same. Just eye it. Once you are done, it is in its own layer, which you cannot change. If you don't like how it ends up delete that layer and start again.

One handy thing is that you can use ctrl-c as soon as you resize it and then ctrl-v the next time you need to insert your image. Assuming you want the same size image this will save you some time.

When you are done choose File->Export to PDF to get it back into the PDF format I assume you'll want for sending your signed doc.

Note: A downside to Xournal is the finished document looks like the fonts are converted to an image. Fonts are no longer as crisp. Still it looks better than if you printed and rescanned and is much faster. [Note: in my most recent experience it seems this problem has been solved. Maybe I just got lucky with the particular fonts used. Please leave a comment abt your experience and I'll update accordingly.] This issue seems to be fixed in Xournal++ version 1.0.20.

Ubuntu – Convert pdf to monochrome black-and-white via command line

Adapting this answer over on SuperUser, this can be achieved by converting the PDF to PostScript and back using a redefined setrgbcolor command:

gs -o <output-file.pdf> -sDEVICE=pdfwrite \
-c "/osetrgbcolor {/setrgbcolor} bind def /setrgbcolor {pop [0 0 0] osetrgbcolor} def" \
-f <input-file.ps>

Best Answer

Related Solutions

Ubuntu – How to edit a picture into an existing PDF file

Ubuntu – Convert pdf to monochrome black-and-white via command line

Related Question