I use pdfimages
and convert
recommended by Anthon to remove the OCRed text of a pdf file, and the size of the pdf file changes from 29MB to 373MB.
My first step is split the pdf file into a pbm file per pdf page:
mkdir tmp1
pdfimages ull.pdf tmp1/ull
The total size of the generated pbm files are 788M.
In my next step, I convert and combine the generated pbm files to a pdf file
cd tmp1
convert ull*.pbm all.pdf
This goes wrong, however, because it requires more than 1 GB space on /tmp
, and my /tmp
doesn't have that much free space. So my second step is actually:
mkdir tmp2
for i in ull-*.pbm; do convert $i tmp2/$i.pdf ; done
cd tmp2
pdftk ull-???.pbm.pdf ull-????.pbm.pdf cat output ../../all.pdf
The generated pdf file all.pdf
has 373MB, much larger than the original size 29MB.
I run pdftk all.pdf output new.pdf compress
, but it doesn't reduce the file size.
Since all I want is to remove OCRed text from the pdf file, how can I avoid the file size bloating?
Best Answer
If the original image are JPEG files, you could use
pdfimages
option-j
. Fromman pdfimages
:I am not sure how to control the way convert stores the images in the PDF file, but you can use
-quality
and-resize
to alter the compression quality.By calling
convert
in one of the following waysyou can have convert use
/home/tim/tmp
as the temporary directory and circumvent the space problems. (Which probably has no influence on resulting file size).