Linux – How to rasterize all of the text in a PDF

linuxocrpdfpdftk

You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document?

And there are OCR tools which can help you to make a proper document which just stores the text?

Well, I need the reverse of that! Let's say I have a perfect pdf document generated with pdflatex and I need to turn it into such a "huge" pdf, which looks exactly the same when printed on paper (with a certain dpi value), but is just a picture of the original.

My initial idea is to turn the pdf into a series of JPGs and then back into a PDF, but perhaps there is some canonical way for that?

In case you wonder why I would want to do such a thing: I'm currently stuck with a network printer, which is not maintained by me, and which randomly drops characters in printed files! So until someone figures out what's wrong there, I want this as workaround.

Best Answer

You could test out if image based PDF's are polluted as well. First convert PDF to (multipage) TIFF, e.g. with ghostscript:

gs -sDEVICE=tiffg4 -o sample.tif sample.pdf

Then convert the TIFF to PDF, e.g.:

tiff2pdf -z -f -F -pA4 -o sample-img.pdf sample.tif

This result in a PDF file where the pages are images instead of text.

Alternatively, if your system supports printing of TIFF files try to print it directly.

There is also the option of pdf2ps for converting PDF to PS, which if works, would likely be preferable.

But LibreOffice doesn't work for me?

If you're encountering an issue with Draw not being able to do this (I was using version of LibreOffice):

Version: 4.1.4.2
Build ID: 4.1.4.2-4.fc19

**NOTE:* You might be missing this package which is part of LibreOffice:

$ rpm -aq|grep "libre.*pdf"
libreoffice-pdfimport-4.1.4.2-4.fc19.x86_64

This is what the package looks like on Red Hat based distros such as Fedora. I would assume that on Debian/Ubuntu there is a similarly named package, probably libreoffice-pdfimport.

Alternatives?

You could try Okular.

Okular allows you to review and annotate your documents. Annotations created in Okular are automatically saved in the internal local data folder for each user. Okular does not implicitly change any document it opens.

screenshot

ss of okular

What else?

As @Terdon's answer shows, you can also use GIMP, along with a whole host of other tools. @Terdon also was kind enough to post this link in our chatroom which has a list of other tools for annotating PDFs as well as viewing them.

Find duplicate PDF files by content

Since different publishers use different methods of "marking" the PDFs you need to make sure you compare without taking the markings into account.

You also need an efficient method to compare a new PDF to all already downloaded PDFs in case you repeatedly download the same PDF and it is e.g. marked with the IP and/or date-time-stamp as you suggest. You don't want to use a time consuming comparison mechanism that compares each new PDF with many already downloaded PDFs

What you need is a utility that strips each of the possible markings and generate a hash of the remaining data. You will need to keep a hash → file name map, which can be in a simple file, and if a computed hash is already in the file you have a duplicate (and delete it or do whatever needed) and if the hash in not yet there, you add the hash and file name. The file would look something like:

6fcb6969835d2db7742e81267437c432  /home/anthon/Downloads/explanation.pdf
fa24fed8ca824976673a51803934d6b9  /home/anthon/orders/your_order_20150320.pdf

That file is negligently small compared to the original PDFs. If you have millions of PDFs you might consider storing this data in a database. For efficiency sake you might want to include the filesize and number of pages in there (pdfinfo | egrep -E '^Pages:' | grep -Eo '[0-9]*').

The above pushes the problem to removing the markings and generating the hash. If you know where the PDF comes from when invoking the hash generating routine (i.e. if you do the downloads programmatically), you can fine-tune the hash generation based on that. But even without that there are several possibilities for hash generation:

if the metadata for title and author is non-empty and does not including non-specific strings like "Acrobat" or "PDF" you could generate the hash based on just the author and title information. Use pdfinfo -E file.pdf | grep -E '^(Author:)|(Title:) | md5sum to get the hash. You can include the number of pages in calculating the hash as well ('Pages:' in the pdfinfo output).
if the previous rule doesn't work and the PDF contains images, extract the images and generate a hash on the combined image data. If the images ever contain text in the footer or header like "Licensed to Joe User", strip an X number of lines form the top or bottom, before calculating the hash. If that markings is in some big lettered grayed background text this will of course not work, unless you filter out pixels that are not totally black (for that you could use imagemagick). You can use pdfimages to extract the image information into a temporary file.
if the previous rules don't work (because there are no images) you can use pdftext to extract the text, filter out the marking (if you filter out a little to much, that is not a problem) and then generate the hash based on that.

Additionally you can compare if the file size of the old file found via the hash and see if is within certain margins with the new file. Compression and ifferences in strings (IP/date-time-stamp) should only result in less than one percent difference.

If you know the method the publisher uses when determining the hash, you can directly apply the "right" method of the above, but even without that you can check for the metadata and apply some heuristics, or determine the number of images in a file and compare that with the number of pages (if they are close you probably have a document consisting of scans). pdftext on scanned image PDFs also has a recognisable output.

As a basis to work from I created a python package that is on bitbucket and/or can be installed from PyPI using pip install ruamel.pdfdouble. This provides you with the pdfdbl command that does the scanning as described above on metadata, extracted images or on text. It doesn't do any filtering of markings (yet), but the readme describes which (two) methods to enhance to do add that.

The included readme:

ruamel.pdfdouble

this package provides the pdfdbl command:

pdfdbl scan dir1 dir2

This will walk down the directories provided as argument and for the PDF files found, create a hash based on (in order):

metadata if unique
images if the number of images
text

This assumes that pdfinfo, pdfimages and pdftotext` from the poppler-utils package are avaialable.

A "database" is build up in ~/.config/pdfdbl/pdf.lst against which further scans are tested.

Removing markings

In ruamel/pdfdouble/pdfdouble.py there are two methods that can be enhanced to filter out markings in the PDF that make them less unique and make virtually the same files to have different hashes.

For text the method PdfData.filter_for_marking should be extended to remove and markings from the string that is its arguments and return the result.

For scanned images the method PdfData.process_image_and_update needs to be enhanced, e.g. by cutting off the images bottom and top X lines, and by removing any gray background text by setting all black pixels to white. This function needs to update the hash passed in using the .update() method passing in the filtered data.

Restrictions

The current "database" cannot handle paths that contain newlines

This utility is currently Python 2.7 only.

IP conforming stringparts can be substituted with Python's re module:

import re
IPre = re.compile("(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}"
              "([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])")

x = IPre.sub(' ', 'abcd 132.234.0.2 ghi')
assert x == 'abcd   ghi'