Shell – Searching files containing non-ASCII characters

grepregular expressionshell-scripttext;

I'm trying to find files in a directory that contains some non-ASCII Unicode characters. The exact characters that I have to find are not known to me beforehand.

Conceptually, this should be an easy task — find all files that match the regex [^\0-\x7f]. However, I cannot come up with something that can actually do this.

The closest thing that I could come up with is:

find . -type f -exec grep -Plv '[\0-\x7f]' {} \;

which ends up listing most normal text files due to matches on the blank lines.

The -e switch is not allowed in combination with -P so I cannot use -e '[\0-\x7f]' -e '^$', and converting the regex to [\0-\x7f]|^$ would be obviously wrong because now its an "or".

Is there another way to search for such characters?

Best Answer

With grep -Pv '[\0-\x7f]', you're asking for lines that don't (-v) contain an ASCII character. That's not the same thing as lines that contain a non-ASCII character. Just ask for that.

LC_ALL=C grep -lP '[^\0-\x7f]'

Instead of a code point range, you could ask for non-printable characters in an ASCII locale. This is almost equivalent (it also includes control characers).

LC_ALL=C grep -l '[^[:print:]]'

An equivalent, more complicated way is to search for lines that are entirely made up of ASCII characters and invert the matching.

LC_ALL=C grep -vlP '^[\0-\x7f]*$'
Related Question