Change Text File Encoding without knowning the source encoding

character encoding

I want to change the charset encoding for a file in unix with a single command but since this will be an automated process it's impossible for me to know the source encoding.

So I want a command that will change the encoding to UTF-8 for any source encoding.

Best Answer

You can use iconv or recode to convert the file. But you will need to specify the source encoding.

The information about the source encoding has to come from somewhere. A plain text file doesn't contain any information about its encoding. Some types of formatted text contain an indication (for example headers in HTML or in LaTeX), but in general, you're on your own. It's up to the environment to know what encoding it uses for its text file.

You can try to guess the source encoding. This only has a chance of working if you have some information about the file — either you know what language it's in (e.g. you know it's in Polish or English), or there's only a small number of potential encodings (e.g. it's either UTF-8 or Latin-1). See How can I test the encoding of a text file... Is it valid, and what is it? and How do I re-encode a mixed encoded text file for some possibilities, including Enca and Perl Encode::Guess. You'll need to work out based on your data set whether one of these tools can work for you.

Related Solutions

How to test the encoding of a text file… Is it valid, and what is it

The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

Demonstration:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

Here is how I created the files:

$ echo ä > umlaut-utf8.txt

Nowadays everything is utf-8. But convince yourself:

$ hexdump -C umlaut-utf8.txt 
00000000  c3 a4 0a                                          |...|
00000003

Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

Convert to the other encodings:

$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt

Check the hex dump:

$ hexdump -C umlaut-iso88591.txt 
00000000  e4 0a                                             |..|
00000002
$ hexdump -C umlaut-utf16.txt 
00000000  ff fe e4 00 0a 00                                 |......|
00000006

Create something "invalid" by mixing all three:

$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt

What file says:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt:    application/octet-stream; charset=binary
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

without -i:

$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt:    data
umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt:     UTF-8 Unicode text

The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

Here is more information about the file command: http://www.linfo.org/file_command.html

Character Encoding – How to Re-encode a Mixed Encoded Text File

What you have is in fact ASCII (in its usual encoding in 8-bit bytes) with a bit of UCS-2 (Unicode restricted to the basic plane (BMP), where each character is encoded as two 8-bit bytes), or perhaps UTF-16 (an extension of UCS-2 that can encode all of Unicode by using a multi-word encoding for code points above U+D7FF).

I doubt you'll find a tool that can handle such an unholy mixture out of the box. There is no way to decode the file in full generality. In your case, what probably happened is that some ASCII data was encoded into UTF-16 at some point (Windows and Java are fond of UTF-16; they're practically unheard of in the unix world). If you go by the assumption that the original data was all ASCII, you can recover a usable file by stripping off all null bytes.

<bizarre tr -d '\000' >ascii

Best Answer

Related Solutions

How to test the encoding of a text file… Is it valid, and what is it

Character Encoding – How to Re-encode a Mixed Encoded Text File

Related Question