Cleaning up pdftotext font issues

asciiconversionpdfspecial characters

I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS Word.

The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: fi and fl.

What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but I'm not sure how to detect these special characters.

Best Answer

By default, pdftotext outputs unicode (UTF-8) data. If your terminal or text editor doesn't support UTF-8, ligatures such as "fi" and "fl" (which can be represented as a single character in unicode) will appear strangely, as you have noticed.

The simple fix is to tell pdftotext to output ASCII instead of unicode:

pdftotext -enc ASCII7 input.pdf output.txt

This should produce clean ASCII output, removing your need to clean it up manually afterwards.

Related Question