Avoid two-character umlauts in PDFs

internationalizationpdfpreviewsafari

Introductory explanation

An umlaut is a German vowel, represented in writing as a letter with two dots (diaeresis) over the basic vowel. Examples of umlauts are ä, ö, and ü.

These three letters can be represented in text either as one single character – for example, ü as Unicode U+00FC – or as two characters: the basic vowel (e.g. u, U+0075) and the combining diaeresis ( ¨, U+0308).

Both the single character umlaut and the two-character umlaut look the same in a PDF document, but their underlying code is different. This animation shows text copied from the same (!) PDF file opened in Firefox (top) and Preview (bottom) into a plain text editor (BBEdit) and then deleting individual letters:

enter image description here

When the umlaut is represented as one character, and you search for a German word with an umlaut, e.g. Tür "door", in a text, you will find that word if it is there. If on the other hand the umlaut is represented as two characters and you search for Tür, you will not find it:

Die Tür ist offen.  <= you will find "Tür" in this text
Die Tu¨r ist offen. <= you will not find "Tür" in this text

Question

In Apple's Preview and Safari, but also in the latest version of Adobe Acrobat Reader DC (18.011.20058), umlauts in PDF documents are represented as two characters (vowel plus diaeresis), while in the same PDF document, when I open it in Firefox, Chrome, or an older version of Adobe Acrobat X Pro (10.1.16), they are represented as a single character.

Why is that so, and how can I avoid two-character umlauts when I create PDF documents?

Best Answer

Whether you wind up with 1 or 2 characters depends on how Unicode Normalization is applied by the apps and processes you are using.

I don't know if there is any way to guarantee one or the other except perhaps via a utility like UnicodeChecker.

Since the two forms are equivalent, a competent search system should find either of them.