How to Filter Invalid UTF-8 Characters – Command Line Techniques

character encodingcommand linetext processingunicode

I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other words, I'm looking for grep [notutf8].

An ideal solution would be portable, short and generalizable to other encodings, but if you feel the best way is to bake in the definition of UTF-8, go ahead.

Best Answer

If you want to use grep, you can do:

grep -axv '.*' file

in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).

Related Question