Extracting background images from a PDF file

extractpdfpdf-readerxpdf

I have a PDF file containing maps of the building I work in, here:

http://www.libsys.und.edu/dev/FloorPlans_All.pdf

The original source files have been lost, and I've been asked to extract the map images, preferably without the text and icons that have been overlaid on top of them. This has proven annoyingly difficult.

So far, I have tried the following GUI programs:

  • Adobe Reader: lets me select text, but not the background images
  • FoxIt PDF Viewer: lets me select text, but not the background images
  • XPDF on Ubuntu 10.10: lets mes select text, but not the background images

And also the following command-line programs:

  • pdfimages: extracts the icons indicating bathrooms just fine, but not the background images
  • pdftohtml: same as pdfimages, plus it makes a poorly marked up HTML document
  • pdfextract: same as pdfimages
  • convert: successfully saved images, but with the text burned into them

I've even tried opening the PDF manually in a text editor and extracting the stream objects by pasting them into a new file and saving it with a .jpg, .png, or .bmp extension (each in turn). Considering how little I know about the internal structure of PDF files, it's no surprise that this didn't work.

So … is there any way I can retrieve the map images from this thing without also getting the text and icons?

Best Answer

You can download the XPDF library from http://www.foolabs.com/xpdf/download.html for Linux and Windows. Then run pdfimages -j input.pdf output and you should get output-000.jpg, output-001.jpg, etc. Also, check out http://linuxcommand.org/man_pages/pdfimages1.html for more usage options.

Related Question