Grep Unicode – Find All Lines Containing Japanese Kanjis

grepunicode

In a huge UTF-8 text file, I want to show all lines that contains Japanese kanjis.
What grep (or other) expression does this?

If I am not mistaken, kanjis are the characters between \u4e00 and \u4dbf.

I don't need to show kanas, but showing them too would not be a big problem.

Best Answer

It is impossible (without using a huge table) to tell apart a japanese kanji from a Han ideograph not used in Japanese (eg, a chinese or korean variant).

If you just want to detect any Han ideograph in the basic range (\u4e00 to \u9fff) then they are encoded in 3 bytes, the first byte is always between 0xe4 and 0xe9, the second and third bytes between 0x80 and 0xbf.

There are two difficulties here, first you have to tell grep you want to look after bytes and not characters; then you have to type the 0xe4, 0xe9, 0x80 and 0xbf bytes to put them in the regexp expression.

I discovered the -P switch does both; and the line you want is:

grep -P "[\xe4-\xe9][\x80-\xbf][\x80-\xbf]"

and if you want kana too:

grep -P "[\xe4-\xe9][\x80-\xbf][\x80-\xbf]|\xe3[\x81-\x83][\x80-\xbf]"

Related Solutions

Shell – How to find text in files and only keep the respective matching lines using the terminal on OS X

grep would only find lines matching a pattern in a file, it wouldn't change the file. You could use sed to find the pattern and make changes to the file:

sed '/\B\/foobar\b/!d' filename

would display lines matching /foobar in the file. In order to save changes to the file in-place, use the -i option.

sed -i '/\B\/foobar\b/!d' filename

You could use it with find too:

find . -type f -exec sed -i'' '/\B\/foobar\b/!d' {} \;

How does grep decide that a file is binary

It appears to be the presence of the null character in the file.(displayed ^@ usually) I entered various control characters into a text file(like delete, ^?, for example), and only the null character caused grep to consider it a binary. This was only tested for grep. The less and diff commands, for instance, may have different methods. Control characters in general don't appear except in binaries. The exceptions are the whitespace characters: newline(^M), tab(^I), formfeed(^L), vertical tab(^K), and return(^J).

However, foreign characters, like arabic or chinese letters, are not standard ascii, and perhaps could be confused with control characters. Perhaps that's why it's only the null character.

You can test it out for yourself by insterting control characters into a text file using the text editor vim. Just go to insert mode, press control-v, and then the control character.

Best Answer

Related Solutions

Shell – How to find text in files and only keep the respective matching lines using the terminal on OS X

How does grep decide that a file is binary

Related Question