Avoid bloating file size when removing OCRed text from a pdf file

imagemagickpdfpdftk

I use pdfimages and convert recommended by Anthon to remove the OCRed text of a pdf file, and the size of the pdf file changes from 29MB to 373MB.

My first step is split the pdf file into a pbm file per pdf page:

mkdir tmp1
pdfimages ull.pdf tmp1/ull

The total size of the generated pbm files are 788M.

In my next step, I convert and combine the generated pbm files to a pdf file

cd tmp1
convert ull*.pbm all.pdf

This goes wrong, however, because it requires more than 1 GB space on /tmp, and my /tmp doesn't have that much free space. So my second step is actually:

mkdir tmp2
for i in ull-*.pbm; do convert $i tmp2/$i.pdf ; done
cd tmp2
pdftk ull-???.pbm.pdf ull-????.pbm.pdf cat output ../../all.pdf

The generated pdf file all.pdf has 373MB, much larger than the original size 29MB.
I run pdftk all.pdf output new.pdf compress, but it doesn't reduce the file size.

Since all I want is to remove OCRed text from the pdf file, how can I avoid the file size bloating?

Best Answer

If the original image are JPEG files, you could use pdfimages option -j. From man pdfimages:

-j     Normally, all images are written as PBM (for monochrome  images)
       or  PPM  (for  non-monochrome  images) files.  With this option,
       images in DCT format are  saved  as  JPEG  files.   All  non-DCT
       images are saved in PBM/PPM format as usual.

I am not sure how to control the way convert stores the images in the PDF file, but you can use -quality and -resize to alter the compression quality.

By calling convert in one of the following ways

TMPDIR=/home/tim/tmp  convert ...
MAGICK_TMPDIR=/home/tim/tmp convert ...

you can have convert use /home/tim/tmp as the temporary directory and circumvent the space problems. (Which probably has no influence on resulting file size).

Related Question