Character Encoding – Convert Non-ISO Extended-ASCII to UTF-8

character encodingtext;

I have a txt file :

$ file -i x.txt
x.txt: text/plain; charset=unknown-8bit
$ file x.txt 
x.txt: Non-ISO extended-ASCII text, with CRLF line terminators

And there are some characters that are incorrectly encoded :

trwa³y, sta³y, usuwaæ

How can I change this file's encoding to UTF-8 ? I have tried the following way so far :

$ iconv -f ASCII -t UTF-8 x.txt
                puiconv: illegal input sequence at position 4

Maybe I should somehow use extended ASCII ( high ASCII ) but cannot find it in iconv's encoding list.

Best Answer

file tells you “Non-ISO extended-ASCII text” because it detects that this is:

most likely a “text” file from the lack of control characters (byte values 0–31) other than line breaks;
“extended-ASCII” because there are characters outside the ASCII range (byte values ≥128);
“non-ISO” because there are characters in the 128–159 range (ISO 8859 reserves this range for control characters).

You have to figure out which encoding this file seems to be in. You can try Enca's automatic recognition. You might need to nudge it in the right direction by telling it in what language the text is.

enca x.txt
enca -L polish x.txt

To convert the file, pass the -x option: enca -L polish x.txt -x utf8 >x.utf8.txt

If you can't or don't want to use Enca, you can guess the encoding manually. A bit of looking around told me that this is Polish text and the words are trwały, stały, usuważ, so we're looking for a translation where ³ → ł and æ → ż. This looks like latin-2 or latin-10 or more likely (given “non-ISO” CP1250 which you're viewing as latin1. To convert the file to UTF-8, you can use recode or iconv.

recode CP1250..utf8 <x.txt >x.utf8.txt
iconv -f CP1250 -t UTF-8 <x.txt >x.utf8.txt

Related Solutions

Text – View File Containing DOS Text and Escape Sequences

That's MSDOS charset.

Try recode cp437..u8 in a UTF8 terminal.

It gives:

██▀▀▀▀▀▀ ██▀▀▀▀▀█  █▀▀▀▀▀█ ██▀▀█▀▀█ ██       █▀▀▀▀▀█ ██▀▀█ ██ ██▀▀▀▀▀▄
██▄▄▄▄▄▄ ██▄▄▄▄▄█  █▄▄▄▄▄█ ██ ██ ██ ██       █▄▄▄▄▄█ ██ ██ ██ ██    ██
      ▄█ ██        █    ▄█ ██    ██ ██       █    ▄█ ██ ██ ██ ██    ██
▄▄▄▄▄▄▄█ ██        █     █ ██    ██ ██▄▄▄▄▄  █     █ ██ ██▄▄█ ██▄▄▄▄▄▀

in colour.

Command Line – Character Encodings Supported by more, cat, and less

Your shell can display accents etc because it is probably using UTF-8. Since the file in question is a different encoding, less more and cat are trying to read it as UTF and fail. You can check your current encoding with

echo $LANG

You have two choices, you can either change your default encoding, or change the file to UTF-8. To change your encoding, open a terminal and type

export LANG="fr_FR.ISO-8859"

For example:

$ echo $LANG 
en_US.UTF-8
$ cat foo.txt 
J'ai mal � la t�te, c'est chiant!
$ export LANG="fr_FR.ISO-8859"
$ xterm <-- open a new terminal 
$ cat foo.txt 
J'ai mal à la tête, c'est chiant!

If you are using gnome-terminal or similar, you may need to activate the encoding, for example for terminator right click and:

enter image description here

For gnome-terminal :

enter image description here

Your other (better) option is to change the file's encoding:

$ cat foo.txt 
J'ai mal � la t�te, c'est chiant!
$ iconv -f ISO-8859-1 -t UTF-8  foo.txt > bar.txt
$ cat bar.txt 
J'ai mal à la tête, c'est chiant!

Best Answer

Related Solutions

Text – View File Containing DOS Text and Escape Sequences

Command Line – Character Encodings Supported by more, cat, and less

Related Question