How to convert unknown-8bit file to utf8

character encodingunicode

I have a .srt file that displays as gibberish when I open it in gEdit in ubuntu.
So I want to convert it to utf8 to be able to read it.

When I try to figure out what the encoding it give:

file -i x.srt 
x.srt: text/plain; charset=unknown-8bit

In another attempt I found:

find .  -type f -print | xargs file
./x.srt:   Non-ISO extended-ASCII text, with CRLF line terminators

Also I tried enca:

enca x.srt 
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.

and

enca -L Persian  x.srt 
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.

So I am wondering how to know the encoding and eventually convert it to a usable format.

Best Answer

There is no reliable way to convert from an unknown encoding to a known one.

In your case, if you know the original text is in Farsi / Persian, maybe you can identify a number of possible encodings, and iterate over those until you see the output you expect.

Based on quick googling, there is no standard, stable converter for the legacy Iran System encoding, and the only remaining popular alternative is Windows codepage 1256. I have included MacArabic here mainly for illustrative purposes (though maybe it would even be a feasible alternative for Farsi, too?)

for encoding in cp1256 macarabic; do
    if iconv -f "$encoding" -t utf-8 inputfile >outputfile."$encoding"; then
        echo "$encoding: possible"
    else
        echo "$encoding: skipped"
        rm outputfile."$encoding"
    fi
done

(My version of iconv doesn't actually support MacArabic, but maybe you will have more luck; or you can try a different conversion tool.)

Examine the resulting output files; see if one of them seems to make sense.

If you know what the output should look like, you can also look up individual mappings for bytes in the file. If the first byte is 0x94 and you know it should display as ﭖ you have basically established that the encoding is Iran System. Maybe look up a few more bytes to verify this conclusion. The Wikipedia page for this encoding has a table of all the characters. Obviously, this is painstaking, slow, and error prone, especially if there are many candidate encodings to choose from.

For some encodings, you can find a list e.g. at https://tripleee.github.io/8bit/ -- for others, maybe you just have to look at the corresponding Wikipedia coding tables.

Related Question