I am trying to identify a strange character I have found in a file I am working with:
$ cat file
�
$ od file
0000000 005353
0000002
$ od -c file
0000000 353 \n
0000002
$ od -x file
0000000 0aeb
0000002
The file is using ISO-8859 encoding and can't be converted to UTF-8:
$ iconv -f ISO-8859 -t UTF-8 file
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
$ iconv -t UTF-8 file
iconv: illegal input sequence at position 0
$ file file
file: ISO-8859 text
My main question is how can I interpret the output of od
here? I am trying to use this page which lets me translate between different character representations, but it tells me that 005353
as a "Hex code point" is 卓
which doesn't seem right and 0aeb
as a "Hex code point" is ૫
which, again, seems wrong.
So, how can I use any of the three options (355
, 005353
or 0aeb
) to find out what character they are supposed to represent?
And yes, I did try with Unicode tools but it doesn't seem to be a valid UTF character either:
$ uniprops $(cat file)
U+FFFD ‹�› \N{REPLACEMENT CHARACTER}
\pS \p{So}
All Any Assigned Common Zyyy So S Gr_Base Grapheme_Base Graph X_POSIX_Graph
GrBase Other_Symbol Print X_POSIX_Print Symbol Specials Unicode
if I understand the description of the Unicode U+FFFD character, it isn't a real character at all but a placeholder for a corrupted character. Which makes sense since the file isn't actually UTF-8 encoded.
Best Answer
Your file contains two bytes, EB and 0A in hex. It’s likely that the file is using a character set with one byte per character, such as ISO-8859-1; in that character set, EB is ë:
Other candidates would be δ in code page 437, Ù in code page 850...
od -x
’s output is confusing in this case because of endianness; a better option is-t x1
which uses single bytes:od -x
maps tood -t x2
which reads two bytes at a time, and on little-endian systems outputs the bytes in reverse order.When you come across a file like this, which isn’t valid UTF-8 (or makes no sense when interpreted as a UTF-8 file), there’s no fool-proof way to automatically determine its encoding (and character set). Context can help: if it’s a file produced on a Western PC in the last couple of decades, there’s a fair chance it’s encoded in ISO-8859-1, -15 (the Euro variant), or Windows-1252; if it’s older than that, CP-437 and CP-850 are likely candidates. Files from Eastern European systems, or Russian systems, or Asian systems, would use different character sets that I don’t know much about. Then there’s EBCDIC...
iconv -l
will list all the character sets thaticonv
knows about, and you can proceed by trial and error from there.(At one point I knew most of CP-437 and ATASCII off by heart, them were the days.)