Linux – How to identify non-ASCII characters from the shell

asciiawkgreplinuxperl

Is there a simple way to print all non-ASCII characters and the line numbers on which they occur in a file using a command line utility such as grep, awk, perl, etc?

I want to change the encoding of a text file from UTF-8 to ASCII, but before doing so, wish to manually replace all instances of non-ASCII characters to avoid unexpected character changes effected by the file conversion routine.

Best Answer

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt
2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

$ grep -n -P '[\x80-\xFF]' utf8.txt
2:Pour être ou ne pas être
4:Byť či nebyť
5:是或不

where utf8.txt is

$ cat utf8.txt
To be or not to be.
Pour être ou ne pas être
Om of niet zijn
Byť či nebyť
是或不

Related Solutions

Grepping a substring from a grep result

I don't know what OS you're on, but on FreeBSD 7.0+ grep has a -o option to return only the part that matches the pattern. So you could
grep "marker-1234" filter_log | grep -oE "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"

Returns a list of just IP addresses from the 'filter_log"...

This works on my system, but again, I don't know what your version of grep supports.

Linux – Using find or grep to locate filenames with accented characters from a different encoding system (Windows to Linux)

The GNU tools appear to have code that causes accented letters to be treated like their base letters when matching a regex character class, if supported by the character encoding. This is intended as a "do what I mean" sort of feature to make writing regexes easier, but in this case it's getting in your way.

Try the following modification to your "find" command line:

LANG=C find . -regex '.*[^a-zA-Z./].*'

This sets the LANG environment variable only in the context of the "find" command. Since the "C" language encoding supports only ASCII, the accented letters will no longer be treated as their base letters, and so will be matched properly by your regex.

Best Answer

Related Solutions

Grepping a substring from a grep result

Linux – Using find or grep to locate filenames with accented characters from a different encoding system (Windows to Linux)

Related Question