Change Text File Encoding without knowning the source encoding

character encoding

I want to change the charset encoding for a file in unix with a single command but since this will be an automated process it's impossible for me to know the source encoding.

So I want a command that will change the encoding to UTF-8 for any source encoding.

Best Answer

You can use iconv or recode to convert the file. But you will need to specify the source encoding.

The information about the source encoding has to come from somewhere. A plain text file doesn't contain any information about its encoding. Some types of formatted text contain an indication (for example headers in HTML or in LaTeX), but in general, you're on your own. It's up to the environment to know what encoding it uses for its text file.

You can try to guess the source encoding. This only has a chance of working if you have some information about the file — either you know what language it's in (e.g. you know it's in Polish or English), or there's only a small number of potential encodings (e.g. it's either UTF-8 or Latin-1). See How can I test the encoding of a text file... Is it valid, and what is it? and How do I re-encode a mixed encoded text file for some possibilities, including Enca and Perl Encode::Guess. You'll need to work out based on your data set whether one of these tools can work for you.

Related Question