Search special characters or short mathematical symbols in pdf files

character encodinglatexpdfpdf-readerspecial characters

I have Adobe Reader, Okular and Document Viewer as pdf readers. The papers I read are often texts with mathematical formulae, generated by LaTeX.

But it seems that searching special characters or mathematical symbols in pdf files with these viewers does not work perfectly. What I usually do is to select the key part (special characters or mathematical expressions) from the file, then Ctrl+C, then Ctrl+F, then Ctrl+V, quite often what the viewer highlights are unfortunately not correct.

I believe this is an important feature for the viewer, and there is a real need to look for not only words but also special characters in a document.

Could anyone tell me how you workaround this? Is there any better pdf reader or any smart way to search?

Best Answer

There is probably no generic solution to your problem, even though it would be cool if there was.

The core of the problem is that PDF is designed to specify how something should look when printed. Being able to search the PDF for a formula was probably not a mayor concern. So the problem is not the Viewer; the problem is that the PDF doesn't contain the information you are looking for in an accessible way.

When you have, for example, an alpha (α) in a formula, this could be coded

  • as the Unicode character U+03B1
  • as a simple a in a greek font (the Windows font Symbol comes to mind)
  • or it could just be an appropriate vector graphic which looks like an alpha but without having an ASCII or Unicode character associated with it.

In the first case your solution should probably work, but in the second case the search will stop at every single "a" in the text. In the third case the search will come up with nothing at all, since there is no text to be searched.

The problem gets more difficult when you search for elements with indices, such as $A_B^C. This needs to be typeset in a certain way (the B below the A, the C above it), but there is no fixed rule in which order the PDF creator should insert the three characters into a text box; it could even decide to create three separate text boxes, or decide that all the upper indices of a formula come first, and the lower indices come last.

So as an example, the formula $A_B^C = D^E_F$ could be represented as

C E A D B F

or

A B C D E F

or

A C B D F E

or any other way the PDF creator pleases, as long as the position information for each letter is correct to produce the right formula. Needless to say that in the first and third case, you will have a hard time searching for `$A_B$'...

After all this explaining, what can you do?

  • not much
  • try to print the PDF to TIF, then OCR it using a tool that can deal with Mathematical Symbols
  • lobby for paper authors to publish preprints on arxiv.org along with the LaTex source, which you can search more easily
  • lobby for Adobe to add a kind of "equation support" in the next version of PDF to address the problem; of course this would then need to be implemented in the tools used to create and modify the PDF
Related Question