Ubuntu – How to produce a multi-page sandwich pdf with hocr2pdf

ocrpdf

I used tesseract to produce the special html to use with hocr2pdf starting from a muti-page tif.

I tried using hoc2pdf to produce a "sandwich pdf" (image + hidden text layer).

Hocr2pdf produces a one page pdf with all the pages superimposed.

Is there a way to solve this problem or an alternative solution?

Best Answer

I found a workaround to this issue. Hocr2pdf has issues with producing multi-page pdfs so I produced single-page tifs, ran tesseract-ocr, ran hocr2pdf then combined the results with the following script:

for f in ./*.tif; do
   tesseract "$f" "$f" -l fra hocr
   hocr2pdf -i "$f" -s -o "$f.pdf" < "$f.html"
done
pdftk *.tif.pdf cat output "output.pdf" && rm *.tif.pdf && rm *.tif.html