Copy/Paste from documents (PDF, docx) – weird behaviour of diacritics

copy/pastefontms officepdfpreview

When I copy text from PDF (Preview) or Docx (Pages) with Czech characters, some of the Czech characters get copied with their accent "stuck" to them.

To make this even weirder, this behaviour is pretty inconsistent: Sometimes "ř" is copied well, sometimes it isn't. Also, it doesn't seem to be connected to specific font. But I think this happens more often with fonts, that are not native for OSX (such as Cambria, which happens to be MS Office default).

Screenshot from WordPress WYSYWIG textarea

Result of CMD+C for "í": "í́"

Why is this happening?

Edit

  • OSX: 10.13.6 (although it happened to me on older OS and even other machines)
  • apps, I've copied the text from: Preview (PDF), Pages (doc, docx)
  • apps, where I've pasted the text to: anything (from Sublime text to text editor in StackExchange, see above)

Also, I've noticed that this happens often at the ends of words (possibly ends of lines). I will confirm this once it happens again, as the behaviour is frustratingly hard to reproduce.

Best Answer

What you are dealing with is one of the many symptoms of what I consider the bane of every modern programmer's existence: Unicode Normalization and interchanging character encodings.

One could literally write a 1000 page book just on the history of this chaos (and I wouldn't be surprised if someone has already), so I'll boil it down to the basics of what you're encountering here (and I'll be oversimplifying a bit), but then I'll give you some links for "further reading".

First, lets make sure you've got your Input Menu in your menu bar: In System Preferences, open the Keyboard Preference Pane and tick the box under "Input Sources". Then from that menu item, open what is now called "Show Emoji and Symbols". at the top left of the window, select "Customize List", go to "Code Tables" and check "Unicode" and "ISO-8859-1". We'll do a brief lecture and then a demo.

So again, there are two interrelated but separate issues here:

1. Character Encodings

I consider this the root cause of this particular issue. The problem is that Microsoft has for years been notorious for not handling Unicode well because its platforms have more or less stuck with using an older implementation of multilingual character sets, known variously as "wide characters", UCS-2, or UTF-16. This system was implemented years ago, at a time when it was thought that 16 bits (to represent ~65,000 characters) would be sufficient to encode every symbol we'd ever need. Today, there are 1,114,112 standardized Unicode symbols.

So today, most systems (and everything Apple) use an encoding called UTF-8, a variable-width character encoding, where there's no set number of bits to encode any given character. This allows it to be backwards compatible with ASCII, and can also accommodate adding as new symbols and characters as we like.

So when copying text in and out of programs that use a different character set (like Microsoft's), the character set needs to be completely re-encoded, and converted, a process traditionally known as iconv, though there are literally dozens of implementations of how this is done.

2. Unicode Combining Characters

Compounding the issue of encodings is the fact that the Unicode standard has evolved over the years, and realized that in order to keep the number of unique characters limited to "only" in the millions, rather than billions, it might be best to have some characters be "combining characters", characters that modify the previous one in a regular way. By doing so, you don't need a separate entry for every letter with every accent variant, you just add a "shared" accent character to the original character. But it wasn't always done that way, so there are multiple ways of producing the same symbol. Yours is the perfect example.


We start with the symbol LATIN SMALL LETTER I (U+0069):

i

Now, when you want to add the acute accent, Microsoft replaces it with

LATIN SMALL LETTER I WITH ACUTE (U+00ED):

í

But Apple, instead, adds a second character, COMBINING ACUTE ACCENT (U+0301):

́

You can do this yourself (here's where the Character Viewer comes in). Just type an i, then search combining acute in the Character Viewer, double-click the symbol, and voila:

Which is, in fact completely different than the first symbol, above. It is LATIN SMALL LETTER I (U+0069) followed by COMBINING ACUTE ACCENT (U+0301). Copy and paste each into the Character Viewer, and you'll see what I mean.

Yes, both visually represent the same symbol. But if somehow along the line (usually around the same time as character set conversion), a UTF-8 process adds the combining character, but the original pre-combined character is retained? That is, what happens when the "combined symbols" approach is somehow added to the legacy version, rather than replacing it? Well, the "combining character" accent will still want to do its job.

So, when one combines LATIN SMALL LETTER I WITH ACUTE (U+00ED) with COMBINING ACUTE ACCENT (U+0301):

í́

And there you have it.

There's a very famous Stack Overflow answer that demonstrates how far this can go.


Some light reading: