Unsearchable, uncopiable PDF document

I have a PDF document which, for some reason, I can't copy and paste or search in. The PDF is a text-based and not image file. When I try to copy and paste the text into Microsoft Word or GNU Emacs, I get a lot of small boxes in place of the letters. When I try to search the text in Adobe Reader, I can't find words which I can see are there. The document doesn't seem to have any special protections applied to it. I've had PDFs once or twice before. I tried opening it in Google Docs but again, although it comes out as clear text, I cannot search it. Does this ring any bells with anyone?

I tried looking at the fonts of the PDF and it looks like this:

--font-65795-6-- (Embedded Subset)
Type: TrueType
Encoding: Built-in
Century (Embedded Subset)
Type: TrueType
Encoding: Built-in

followed by similar lines for Century, Helvetica, Symbol, Times-Roman, and Verdana.

Best Answer

This PDF probably contains its own font which is embedded into it. In this case, although the PDF will still display correctly, the correct text information is not always available and copying becomes impossible.

The fonts actually are all embedded, but in a way that all encoding information has been removed. This happens when a PDF that is still syntactically fully compliant with the PDF spec had important information about the meaning of the text in it thrown away during the process of making the PDF. It is very difficult to recover the encoding info, and sometimes the best option is to convert the pages to TIFF and then run OCR ...

You can try a PDF to Word Converter, such as AnyBizSoft or a website converter. After conversion, you can get whatever you want from the word or text file. Here is a step by step tutorial for AnyBizSoft. (AnyBizSoft is recommended by many, but I have never used it personally.)

See also Best Free PDF Tools for more tools and converters.

Best Answer

Related Solutions

Related Question