How to test the encoding of a text file… Is it valid, and what is it

character encodingtext processingUtilities

I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding…

The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?

When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?

And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?

Best Answer

The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

Demonstration:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

Here is how I created the files:

$ echo ä > umlaut-utf8.txt

Nowadays everything is utf-8. But convince yourself:

$ hexdump -C umlaut-utf8.txt 
00000000  c3 a4 0a                                          |...|
00000003

Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

Convert to the other encodings:

$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt

Check the hex dump:

$ hexdump -C umlaut-iso88591.txt 
00000000  e4 0a                                             |..|
00000002
$ hexdump -C umlaut-utf16.txt 
00000000  ff fe e4 00 0a 00                                 |......|
00000006

Create something "invalid" by mixing all three:

$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt

What file says:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt:    application/octet-stream; charset=binary
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

without -i:

$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt:    data
umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt:     UTF-8 Unicode text

The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

Here is more information about the file command: http://www.linfo.org/file_command.html

Related Solutions

Most common encoding for strings in C++ in Linux (and Unix?)

This is just a partial answer, since your question is fairly broad.

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded
-fexec-charset: execution character set, i.e. how to encode string literals
-fwide-exec-charset: wide execution character set, i.e. how to encode wide string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

Change Text File Encoding without knowning the source encoding

You can use iconv or recode to convert the file. But you will need to specify the source encoding.

The information about the source encoding has to come from somewhere. A plain text file doesn't contain any information about its encoding. Some types of formatted text contain an indication (for example headers in HTML or in LaTeX), but in general, you're on your own. It's up to the environment to know what encoding it uses for its text file.

You can try to guess the source encoding. This only has a chance of working if you have some information about the file — either you know what language it's in (e.g. you know it's in Polish or English), or there's only a small number of potential encodings (e.g. it's either UTF-8 or Latin-1). See How can I test the encoding of a text file... Is it valid, and what is it? and How do I re-encode a mixed encoded text file for some possibilities, including Enca and Perl Encode::Guess. You'll need to work out based on your data set whether one of these tools can work for you.

Best Answer

Related Solutions

Most common encoding for strings in C++ in Linux (and Unix?)

Change Text File Encoding without knowning the source encoding

Related Question