How to identify a strange character

character encodingunicode

I am trying to identify a strange character I have found in a file I am working with:

$ cat file
�
$ od file
0000000 005353
0000002
$ od -c file
0000000 353  \n
0000002
$ od -x file
0000000 0aeb
0000002

The file is using ISO-8859 encoding and can't be converted to UTF-8:

$ iconv -f ISO-8859 -t UTF-8 file
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
$ iconv  -t UTF-8 file
iconv: illegal input sequence at position 0
$ file file
file: ISO-8859 text

My main question is how can I interpret the output of od here? I am trying to use this page which lets me translate between different character representations, but it tells me that 005353 as a "Hex code point" is 卓 which doesn't seem right and 0aeb as a "Hex code point" is ૫ which, again, seems wrong.

So, how can I use any of the three options (355, 005353 or 0aeb) to find out what character they are supposed to represent?

And yes, I did try with Unicode tools but it doesn't seem to be a valid UTF character either:

$ uniprops $(cat file)
U+FFFD ‹�› \N{REPLACEMENT CHARACTER}
    \pS \p{So}
    All Any Assigned Common Zyyy So S Gr_Base Grapheme_Base Graph X_POSIX_Graph
       GrBase Other_Symbol Print X_POSIX_Print Symbol Specials Unicode

if I understand the description of the Unicode U+FFFD character, it isn't a real character at all but a placeholder for a corrupted character. Which makes sense since the file isn't actually UTF-8 encoded.

Best Answer

Your file contains two bytes, EB and 0A in hex. It’s likely that the file is using a character set with one byte per character, such as ISO-8859-1; in that character set, EB is ë:

$ printf "\353\n" | iconv -f ISO-8859-1
ë

Other candidates would be δ in code page 437, Ù in code page 850...

od -x’s output is confusing in this case because of endianness; a better option is -t x1 which uses single bytes:

$ printf "\353\n" | od -t x1
0000000 eb 0a
0000002

od -x maps to od -t x2 which reads two bytes at a time, and on little-endian systems outputs the bytes in reverse order.

When you come across a file like this, which isn’t valid UTF-8 (or makes no sense when interpreted as a UTF-8 file), there’s no fool-proof way to automatically determine its encoding (and character set). Context can help: if it’s a file produced on a Western PC in the last couple of decades, there’s a fair chance it’s encoded in ISO-8859-1, -15 (the Euro variant), or Windows-1252; if it’s older than that, CP-437 and CP-850 are likely candidates. Files from Eastern European systems, or Russian systems, or Asian systems, would use different character sets that I don’t know much about. Then there’s EBCDIC... iconv -l will list all the character sets that iconv knows about, and you can proceed by trial and error from there.

(At one point I knew most of CP-437 and ATASCII off by heart, them were the days.)

Related Solutions

Bash – Print a character having a codepoint

This does it in two steps:

$ printf "$(printf '\\U%08x' 0x13000)\n"
?

If you are unable to see the rendered glyph (character image), here is a fixed image:

The two steps are: - The first formats the codepoint number (0x13000) in 8 hexadecimal digits with \U in its front. - The second use the bash builtin printf capacity to print Unicode characters.

The output will be adapted to the locale used.

In utf8 locales like en_US.utf8 and with a font that could present the correct glyph, the output character will be correctly presented in the console.

In this system, the full noto-font package was installed. It contains very nice text fonts, well hinted, and as a plus it also contains glyphs for many many languages, including the "Noto Sans Egyptian Hieroglyphs" font.

This will print all the character list:

$ printf "$(printf '\\U%08x' 778{24..34})"; echo
???????????

the value range is just the hexadecimal values in decimal:

$ printf '%d\n' 0x13000 0x1300A
77824
77834

Character Encoding – Strange Character in a File

This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.

The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.

You can fix this file with the iconv tool, which is part of glibc.

iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8

produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.

It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.

If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:

LC_ALL=C sed -e $'s/\xc2\x96/\xe2\x80\x93/g' < mwe.txt

This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing \xe2\x80\x93 into --.

You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:

LC_ALL=C grep -R $'\xc2\x96` .

will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.

Best Answer

Related Solutions

Bash – Print a character having a codepoint

Character Encoding – Strange Character in a File

Related Question