Character Encoding – Strange Character in a File

character encodingunicode

I have an UTF-8 file that contains a strange character — visible to me just as

<96>

This is how it appears on vi

and how it appears on gedit

and how it appears under LibreOffice

and that makes a series of basic Unix tools misbehave, including:

cat file make the character dissapear, and more as well
I cannot copy and paste inside vi/vim — it will not even find itself
grep fails to display anything as well, as if the character did not exists.

The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.

My basic questions are:

Is there anything wrong with this file?
How can I search for other occurrences of it inside the same file?
How can I grep for other files that may contain the same problem/character?

The file can be found here: file.txt

Best Answer

This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.

The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.

You can fix this file with the iconv tool, which is part of glibc.

iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8

produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.

It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.

If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:

LC_ALL=C sed -e $'s/\xc2\x96/\xe2\x80\x93/g' < mwe.txt

This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing \xe2\x80\x93 into --.

You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:

LC_ALL=C grep -R $'\xc2\x96` .

will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.

Related Solutions

Character Encoding – How to Re-encode a Mixed Encoded Text File

What you have is in fact ASCII (in its usual encoding in 8-bit bytes) with a bit of UCS-2 (Unicode restricted to the basic plane (BMP), where each character is encoded as two 8-bit bytes), or perhaps UTF-16 (an extension of UCS-2 that can encode all of Unicode by using a multi-word encoding for code points above U+D7FF).

I doubt you'll find a tool that can handle such an unholy mixture out of the box. There is no way to decode the file in full generality. In your case, what probably happened is that some ASCII data was encoded into UTF-16 at some point (Windows and Java are fond of UTF-16; they're practically unheard of in the unix world). If you go by the assumption that the original data was all ASCII, you can recover a usable file by stripping off all null bytes.

<bizarre tr -d '\000' >ascii

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Best Answer

Related Solutions

Character Encoding – How to Re-encode a Mixed Encoded Text File

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Related Question