I have a txt file :
$ file -i x.txt
x.txt: text/plain; charset=unknown-8bit
$ file x.txt
x.txt: Non-ISO extended-ASCII text, with CRLF line terminators
And there are some characters that are incorrectly encoded :
trwa³y, sta³y, usuwaæ
How can I change this file's encoding to UTF-8 ? I have tried the following way so far :
$ iconv -f ASCII -t UTF-8 x.txt
puiconv: illegal input sequence at position 4
Maybe I should somehow use extended ASCII
( high ASCII
) but cannot find it in iconv
's encoding list.
Best Answer
file
tells you “Non-ISO extended-ASCII text” because it detects that this is:You have to figure out which encoding this file seems to be in. You can try Enca's automatic recognition. You might need to nudge it in the right direction by telling it in what language the text is.
To convert the file, pass the
-x
option:enca -L polish x.txt -x utf8 >x.utf8.txt
If you can't or don't want to use Enca, you can guess the encoding manually. A bit of looking around told me that this is Polish text and the words are trwały, stały, usuważ, so we're looking for a translation where
³
→ł
andæ
→ż
. This looks like latin-2 or latin-10 or more likely (given “non-ISO” CP1250 which you're viewing as latin1. To convert the file to UTF-8, you can use recode or iconv.