How to Filter Invalid UTF-8 Characters – Command Line Techniques

character encodingcommand linetext processingunicode

I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other words, I'm looking for grep [notutf8].

An ideal solution would be portable, short and generalizable to other encodings, but if you feel the best way is to bake in the definition of UTF-8, go ahead.

Best Answer

If you want to use grep, you can do:

grep -axv '.*' file

in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).

Related Solutions

How to convert unknown-8bit file to utf8

There is no reliable way to convert from an unknown encoding to a known one.

In your case, if you know the original text is in Farsi / Persian, maybe you can identify a number of possible encodings, and iterate over those until you see the output you expect.

Based on quick googling, there is no standard, stable converter for the legacy Iran System encoding, and the only remaining popular alternative is Windows codepage 1256. I have included MacArabic here mainly for illustrative purposes (though maybe it would even be a feasible alternative for Farsi, too?)

for encoding in cp1256 macarabic; do
    if iconv -f "$encoding" -t utf-8 inputfile >outputfile."$encoding"; then
        echo "$encoding: possible"
    else
        echo "$encoding: skipped"
        rm outputfile."$encoding"
    fi
done

(My version of iconv doesn't actually support MacArabic, but maybe you will have more luck; or you can try a different conversion tool.)

Examine the resulting output files; see if one of them seems to make sense.

If you know what the output should look like, you can also look up individual mappings for bytes in the file. If the first byte is 0x94 and you know it should display as ﭖ you have basically established that the encoding is Iran System. Maybe look up a few more bytes to verify this conclusion. The Wikipedia page for this encoding has a table of all the characters. Obviously, this is painstaking, slow, and error prone, especially if there are many candidate encodings to choose from.

For some encodings, you can find a list e.g. at https://tripleee.github.io/8bit/ -- for others, maybe you just have to look at the corresponding Wikipedia coding tables.

Get consistent encoding for all files in directory

I'd refine your script to:

set -o noclobber
for f in ./*.csv
do
  if [ "$(file -b --mime-encoding "$f")" = utf-16le ]; then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new &&
      mv "$f"-new "$f"
  fi
done

Best Answer

Related Solutions

How to convert unknown-8bit file to utf8

Get consistent encoding for all files in directory

Related Question