How to identify a strange character

character encodingunicode

I am trying to identify a strange character I have found in a file I am working with:

$ cat file
�
$ od file
0000000 005353
0000002
$ od -c file
0000000 353  \n
0000002
$ od -x file
0000000 0aeb
0000002

The file is using ISO-8859 encoding and can't be converted to UTF-8:

$ iconv -f ISO-8859 -t UTF-8 file
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
$ iconv  -t UTF-8 file
iconv: illegal input sequence at position 0
$ file file
file: ISO-8859 text

My main question is how can I interpret the output of od here? I am trying to use this page which lets me translate between different character representations, but it tells me that 005353 as a "Hex code point" is which doesn't seem right and 0aeb as a "Hex code point" is which, again, seems wrong.

So, how can I use any of the three options (355, 005353 or 0aeb) to find out what character they are supposed to represent?

And yes, I did try with Unicode tools but it doesn't seem to be a valid UTF character either:

$ uniprops $(cat file)
U+FFFD ‹�› \N{REPLACEMENT CHARACTER}
    \pS \p{So}
    All Any Assigned Common Zyyy So S Gr_Base Grapheme_Base Graph X_POSIX_Graph
       GrBase Other_Symbol Print X_POSIX_Print Symbol Specials Unicode

if I understand the description of the Unicode U+FFFD character, it isn't a real character at all but a placeholder for a corrupted character. Which makes sense since the file isn't actually UTF-8 encoded.

Best Answer

Your file contains two bytes, EB and 0A in hex. It’s likely that the file is using a character set with one byte per character, such as ISO-8859-1; in that character set, EB is ë:

$ printf "\353\n" | iconv -f ISO-8859-1
ë

Other candidates would be δ in code page 437, Ù in code page 850...

od -x’s output is confusing in this case because of endianness; a better option is -t x1 which uses single bytes:

$ printf "\353\n" | od -t x1
0000000 eb 0a
0000002

od -x maps to od -t x2 which reads two bytes at a time, and on little-endian systems outputs the bytes in reverse order.

When you come across a file like this, which isn’t valid UTF-8 (or makes no sense when interpreted as a UTF-8 file), there’s no fool-proof way to automatically determine its encoding (and character set). Context can help: if it’s a file produced on a Western PC in the last couple of decades, there’s a fair chance it’s encoded in ISO-8859-1, -15 (the Euro variant), or Windows-1252; if it’s older than that, CP-437 and CP-850 are likely candidates. Files from Eastern European systems, or Russian systems, or Asian systems, would use different character sets that I don’t know much about. Then there’s EBCDIC... iconv -l will list all the character sets that iconv knows about, and you can proceed by trial and error from there.

(At one point I knew most of CP-437 and ATASCII off by heart, them were the days.)

Related Question