Grep Unicode – Find All Lines Containing Japanese Kanjis

grepunicode

In a huge UTF-8 text file, I want to show all lines that contains Japanese kanjis.
What grep (or other) expression does this?

If I am not mistaken, kanjis are the characters between \u4e00 and \u4dbf.

I don't need to show kanas, but showing them too would not be a big problem.

Best Answer

It is impossible (without using a huge table) to tell apart a japanese kanji from a Han ideograph not used in Japanese (eg, a chinese or korean variant).

If you just want to detect any Han ideograph in the basic range (\u4e00 to \u9fff) then they are encoded in 3 bytes, the first byte is always between 0xe4 and 0xe9, the second and third bytes between 0x80 and 0xbf.

There are two difficulties here, first you have to tell grep you want to look after bytes and not characters; then you have to type the 0xe4, 0xe9, 0x80 and 0xbf bytes to put them in the regexp expression.

I discovered the -P switch does both; and the line you want is:

grep -P "[\xe4-\xe9][\x80-\xbf][\x80-\xbf]"

and if you want kana too:

grep -P "[\xe4-\xe9][\x80-\xbf][\x80-\xbf]|\xe3[\x81-\x83][\x80-\xbf]"