Avoid bloating file size when removing OCRed text from a pdf file

imagemagickpdfpdftk

I use pdfimages and convert recommended by Anthon to remove the OCRed text of a pdf file, and the size of the pdf file changes from 29MB to 373MB.

My first step is split the pdf file into a pbm file per pdf page:

mkdir tmp1
pdfimages ull.pdf tmp1/ull

The total size of the generated pbm files are 788M.

In my next step, I convert and combine the generated pbm files to a pdf file

cd tmp1
convert ull*.pbm all.pdf

This goes wrong, however, because it requires more than 1 GB space on /tmp, and my /tmp doesn't have that much free space. So my second step is actually:

mkdir tmp2
for i in ull-*.pbm; do convert $i tmp2/$i.pdf ; done
cd tmp2
pdftk ull-???.pbm.pdf ull-????.pbm.pdf cat output ../../all.pdf

The generated pdf file all.pdf has 373MB, much larger than the original size 29MB.
I run pdftk all.pdf output new.pdf compress, but it doesn't reduce the file size.

Since all I want is to remove OCRed text from the pdf file, how can I avoid the file size bloating?

Best Answer

If the original image are JPEG files, you could use pdfimages option -j. From man pdfimages:

-j     Normally, all images are written as PBM (for monochrome  images)
       or  PPM  (for  non-monochrome  images) files.  With this option,
       images in DCT format are  saved  as  JPEG  files.   All  non-DCT
       images are saved in PBM/PPM format as usual.

I am not sure how to control the way convert stores the images in the PDF file, but you can use -quality and -resize to alter the compression quality.

By calling convert in one of the following ways

TMPDIR=/home/tim/tmp  convert ...
MAGICK_TMPDIR=/home/tim/tmp convert ...

you can have convert use /home/tim/tmp as the temporary directory and circumvent the space problems. (Which probably has no influence on resulting file size).

Related Solutions

Generate a hyperlinked table of contents and insert into existing PDF

This is taken in whole from @Herbert answering a very similar question on the TeX StackExchange:

Adding Table of Contents to existing PDF

use package pdfpages and then:

\documentclass{article}
\usepackage{pdfpages}
\usepackage{hyperref}

\begin{document}

\tableofcontents
\clearpage\phantomsection
\addcontentsline{toc}{section}{The first section name}% or chapter
\includepdf[pages={1-10},linktodoc,linktodocfit=/Fit]{texte/dtk/dtk11-1/komoedie.pdf}
\clearpage\phantomsection
\addcontentsline{toc}{section}{The second section name}% or chapter
\includepdf[pages={11-19},linktodoc,linktodocfit=/Fit]{texte/dtk/dtk11-1/komoedie.pdf}
\clearpage\phantomsection
\addcontentsline{toc}{section}{The third section name}% or chapter
\includepdf[pages={20-29},linktodoc,linktodocfit=/Fit]{texte/dtk/dtk11-1/komoedie.pdf}
\clearpage\phantomsection
\addcontentsline{toc}{section}{The forth section name}% or chapter
\includepdf[pages={21-39},linktodoc,linktodocfit=/Fit]{texte/dtk/dtk11-1/komoedie.pdf}

\end{document}

Combine multiple PDF files into one (arranged in a matrix)

You could use the utility program pdfnup from the pdfjam suite.

pdfnup in.pdf --nup 3x3

should output the file in-nup.pdf with the pages of in.pdf arranged in a series of pages with a 3x3 matrix from the origin pdf.

You should merge all of you pdf files in an only one, also you must want to specify a paper size for the output file, see the pdfjam docs fot the details.

Best Answer

Related Solutions

Generate a hyperlinked table of contents and insert into existing PDF

Combine multiple PDF files into one (arranged in a matrix)

Related Question