Copy pdf text layer to another pdf

Suppose you've got 2 "scanned" pdf files.

Large, but without text layer.
Smaller (with lower quality images), but with correct text layer.

Both files contain equal images, different only by their compression.

The goal is to embed the same text layer to 1st pdf.

"Just OCR 1st file" is not a solution. I know Acrobat (and some other tools) are able to OCR without altering image layer, but I'm not happy with their OCR quality.

So, I see two possible ways:

Export-import text layer somehow
Replace images in image layer somehow.

Concerning 1st way, I've found nothing.
Concerning 2nd way, I've found two tools, which are quite close hocr2pdf and pdf2text, but they are still not enough, as far as I understood. 🙁

PS: Use example:

I've just found another example where such operation is useful in a systematic manner.

If you've got scanned pdf-1 (without text layer) with, say , "jpg" image compression, Abbyy finereader gives you OCR'd pdf, pdf-2. It would be either quite large, if you choose lossless image compression, or it would have image quality significantly lower than pdf-1. In many cases, best choice is to keep source image compression as-is, and do not recompress the image.

#!/usr/bin/env bash set -eu pdf_merge_text() { local txtpdf; txtpdf="$1" local imgpdf; imgpdf="$2" local outpdf; outpdf="${3--}" if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi ( local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)" trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}" pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}" ) } pdf_merge_text "$@"

Best Answer

Here's a simple shell script to do this on the command-line:

Related Question

Best Answer

Here's a simple shell script to do this on the command-line:

Related Solutions

How to make a searchable PDF document from a scan AND a source Word document

Linux – Batch-OCR many PDFs

Related Question