Copy pdf text layer to another pdf

adobe-acrobatocrpdf

Suppose you've got 2 "scanned" pdf files.

  1. Large, but without text layer.
  2. Smaller (with lower quality images), but with correct text layer.

Both files contain equal images, different only by their compression.

The goal is to embed the same text layer to 1st pdf.

"Just OCR 1st file" is not a solution. I know Acrobat (and some other tools) are able to OCR without altering image layer, but I'm not happy with their OCR quality.

So, I see two possible ways:

  1. Export-import text layer somehow
  2. Replace images in image layer somehow.

Concerning 1st way, I've found nothing.
Concerning 2nd way, I've found two tools, which are quite close hocr2pdf and pdf2text, but they are still not enough, as far as I understood. 🙁

PS: Use example:

I've just found another example where such operation is useful in a systematic manner.

If you've got scanned pdf-1 (without text layer) with, say , "jpg" image compression, Abbyy finereader gives you OCR'd pdf, pdf-2. It would be either quite large, if you choose lossless image compression, or it would have image quality significantly lower than pdf-1. In many cases, best choice is to keep source image compression as-is, and do not recompress the image.

Best Answer

Here's a simple shell script to do this on the command-line:

Save this as ~/pdf-merge-text.sh (and chmod +x it):

#!/usr/bin/env bash

set -eu

pdf_merge_text() {
    local txtpdf; txtpdf="$1"
    local imgpdf; imgpdf="$2"
    local outpdf; outpdf="${3--}"
    if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
    if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
    if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
    (
        local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
        trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
        gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
        pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
    )
}

pdf_merge_text "$@"

Now just call it:

~/pdf-merge-text.sh txt.pdf img.pdf out.pdf

The idea is to strip images from the OCR'd PDF, then merge it via the the technique in the answer above.