There is no reliable way to convert from an unknown encoding to a known one.
In your case, if you know the original text is in Farsi / Persian, maybe you can identify a number of possible encodings, and iterate over those until you see the output you expect.
Based on quick googling, there is no standard, stable converter for the legacy Iran System encoding, and the only remaining popular alternative is Windows codepage 1256. I have included MacArabic here mainly for illustrative purposes (though maybe it would even be a feasible alternative for Farsi, too?)
for encoding in cp1256 macarabic; do
if iconv -f "$encoding" -t utf-8 inputfile >outputfile."$encoding"; then
echo "$encoding: possible"
else
echo "$encoding: skipped"
rm outputfile."$encoding"
fi
done
(My version of iconv
doesn't actually support MacArabic, but maybe you will have more luck; or you can try a different conversion tool.)
Examine the resulting output files; see if one of them seems to make sense.
If you know what the output should look like, you can also look up individual mappings for bytes in the file. If the first byte is 0x94 and you know it should display as ﭖ you have basically established that the encoding is Iran System. Maybe look up a few more bytes to verify this conclusion. The Wikipedia page for this encoding has a table of all the characters. Obviously, this is painstaking, slow, and error prone, especially if there are many candidate encodings to choose from.
For some encodings, you can find a list e.g. at https://tripleee.github.io/8bit/ -- for others, maybe you just have to look at the corresponding Wikipedia coding tables.
I'd refine your script to:
set -o noclobber
for f in ./*.csv
do
if [ "$(file -b --mime-encoding "$f")" = utf-16le ]; then
iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new &&
mv "$f"-new "$f"
fi
done
Best Answer
If you want to use
grep
, you can do:in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).