Ubuntu – Convert Text File Encoding

encoding

I frequently encounter text files (such as subtitle files in my native language, Persian) with character encoding problems. These files are created on Windows, and saved with an unsuitable encoding (seems to be ANSI), which looks gibberish and unreadable, like this:

enter image description here

In Windows, one can fix this easily using Notepad++ to convert the encoding to UTF-8, like below:

enter image description here

And the correct readable result is like this:

enter image description here

I've searched a lot for a similar solution on GNU/Linux, but unfortunately the suggested solutions (e.g this question) don't work. Most of all, I've seen people suggest iconv and recode but I have had no luck with these tools. I've tested many commands, including the followings, and all have failed:

$ recode ISO-8859-15..UTF8 file.txt
$ iconv -f ISO8859-15 -t UTF-8 file.txt > out.txt
$ iconv -f WINDOWS-1252 -t UTF-8 file.txt > out.txt 

None of these worked!

I'm using Ubuntu-14.04 and I'm looking for a simple solution (either GUI or CLI) that works just as Notepad++ does.

One important aspect of being "simple" is that the user is not required to determine the source encoding; rather the source encoding should be automatically detected by the tool and only the target encoding should be provided by the user. But nevertheless, I will also be glad to know about a working solution that requires the source encoding to be provided.

If someone needs a test-case to examine different solutions, the above example is accessible via this link.

Best Answer

These Windows files with Persian text are encoded in Windows-1256. So it can be deciphered by command similar to OP tried, but with different charsets. Namely:

recode Windows-1256..UTF-8 <Windows_file.txt > UTF8_file.txt
(denounced upon original poster’s complaints; see comments)

iconv -f Windows-1256 Windows_file.txt > UTF8_file.txt

This one assumes that the LANG environment variable is set to a UTF-8 locale. To convert to any encoding (UTF-8 or otherwise), regardless of the current locale, one can say:

iconv -f Windows-1256 Windows_file.txt -t ${output_encoding} > ${output_file}

Original poster is also confused with semantic of text recoding tools (recode, iconv). For source encoding (source.. or -f) one must specify encoding with which the file is saved (by the program that created it). Not some (naïve) guesses based on mojibake characters in programs that try (but fail) to read it. Trying either ISO-8859-15 or WINDOWS-1252 for a Persian text was obviously an impasse: these encodings merely do not contain any Persian letter.

Related Question