How to find out why is text not searchable in a PDF (and make it searchable)

pdfsearch

I have a PDF article (not created by me).
However, I can not search for text in the PDF. All PDF viewers I've tried return zero results for words that are obviously in there. I've tried with Adobe Acrobat Professional 8, SumatraPDF and Google Chrome.

How can I find out why the document is not searchable?

Things I've checked:

  • The PDFproducer is reported as 'pdftopdf' and PDf version is reported as 1.3. However, it seems to have been created in something like MSWord or OpenOffice (but not *TEX).
  • It is definitely not a scanned document, as the font is crisp-clear at all zoom levels, and text is selectable.
  • If I look at the security settings (ctrlD in Adobe Acrobat), everything is allowed (like printing, copying, …).
  • my search options do not have 'match case' turned on
  • I can not turn it into a searchable document using Acrobat's 'Recognize text using OCR' as it reports: 'This page contains renderable text'.

So, what else could be the reason for the DPF not being searchable?
And how to make it text-searchable?

Best Answer

  • It may have a custom font encoding that assigns code points to characters in a way that is incompatible with established encodings such as ASCII or UTF-8/Unicode.

  • It may render characters individually out of sequence

  • It may have had characters flattened to paths

See Stack Overflow questions How do you debug PDF files? and the now deleted PDF Font encoding — why can't I copy text from a PDF?

To make it text searchable, the best way may be to go back to the original source (e.g. a Word document) and use a different process to produce the PDF. Alternatively you could try rendering your current PDF as a bitmap and then using OCR, but this will be tedious and produce poor results.

Related Question