Cleaning up pdftotext font issues

asciiconversionpdfspecial characters

I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS Word.

The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ﬁ and ﬂ.

What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but I'm not sure how to detect these special characters.

Best Answer

By default, pdftotext outputs unicode (UTF-8) data. If your terminal or text editor doesn't support UTF-8, ligatures such as "fi" and "fl" (which can be represented as a single character in unicode) will appear strangely, as you have noticed.

The simple fix is to tell pdftotext to output ASCII instead of unicode:

pdftotext -enc ASCII7 input.pdf output.txt

This should produce clean ASCII output, removing your need to clean it up manually afterwards.

tl;dr

Other than the human eyeball, I know of no tool that can inspect a PDF and infer that the program used to produce the PDF has substituted a font.

You could just assume that if Courier font is present in the PDF, something went wrong. A rough and ready check would be

strings filename.pdf | grep Courier

In general, to prevent this sort of problem, I would

0) Make sure any source EPS objects had all fonts embedded.

This is important if the Mac used for the Quark project lacks any of those fonts.

I usually create outlines on my EPS files before I load in Quark but forgot this time.

Converting characters to outlines (i.e. to curves and control-point data) is another way of removing any requirement for the consumer/recipient of the EPS to itself have the used fonts already installed.

1) Make Quark embed fonts

Font Settings

When you export a layout in PDF format, you can choose to reference or embed (download) the fonts used in that layout.

...

Embedding means that the fonts themselves are included in the PDF file. This increases the size of the PDF file, but ensures that the file will display or output correctly.

2) View the list of fonts in Acrobat

and double check that it shows them all as being embedded (Menu: File -> Properties, Fonts tab)

Acrobat Reader fonts dialog

I'd worry about the 5th font in this list.

Update:

Jayme's zip file shows a Quark dialog box that says

"Some EPS/PDF pictures in this document use screen fonts not available in your system, including Univers-Condensed and Univers-CondensedBold"

It is clear from the final PDF image that Quark has substituted Courier for the missing fonts but has applied the letter-positioning from the EPS that would have been appropriate for Univers-Condensed.

One solution is to purchase and install Univers-Condensed and Univers-CondensedBold on the Mac where the Quark project is being output to PDF.

Another solution would be to go back to the application that produced the EPS that has been placed in this project and reproduce that EPS but force it to embed font (used subsets) into the EPS, then reimport that EPS into the final project before producing the final print-ready PDF.

Best Answer

Related Solutions

Linux – Is the pdftotext command line tool for mac

Finding Font Issues in Acrobat

tl;dr

0) Make sure any source EPS objects had all fonts embedded.

1) Make Quark embed fonts

Font Settings

2) View the list of fonts in Acrobat

Update:

Related Question