How to test the encoding of a text file… Is it valid, and what is it

character encodingtext processingUtilities

I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding…

The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?

When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?

And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?

Best Answer

The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

Demonstration:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

Here is how I created the files:

$ echo ä > umlaut-utf8.txt 

Nowadays everything is utf-8. But convince yourself:

$ hexdump -C umlaut-utf8.txt 
00000000  c3 a4 0a                                          |...|
00000003

Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

Convert to the other encodings:

$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt 

Check the hex dump:

$ hexdump -C umlaut-iso88591.txt 
00000000  e4 0a                                             |..|
00000002
$ hexdump -C umlaut-utf16.txt 
00000000  ff fe e4 00 0a 00                                 |......|
00000006

Create something "invalid" by mixing all three:

$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 

What file says:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt:    application/octet-stream; charset=binary
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

without -i:

$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt:    data
umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt:     UTF-8 Unicode text

The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

Here is more information about the file command: http://www.linfo.org/file_command.html

Related Question