Shell – How to specify characters using hexadecimal codes in `grep`

character encodinggrepshellunicode

I am using following command to grep character set range for hexadecimal code 0900 (instead of अ) to 097F (instead of व). How I can use hexadecimal code in place of अ and व?

bzcat archive.bz2 | grep -v '<[अ-व]*\s' | tr '[:punct:][:blank:][:digit:]' '\n' | uniq | grep -o '^[अ-व]*$' | sort -f | uniq -c | sort -nr | head -50000 | awk '{print "<w f=\""$1"\">"$2"</w>"}' > hindi.xml

I get the following output:

    <w f="399651">और</w>
    <w f="264423">एक</w>
    <w f="213707">पर</w>
    <w f="74728">कर</w>
    <w f="44281">तक</w>
    <w f="35125">कई</w>
    <w f="26628">द</w>
    <w f="23981">इन</w>
    <w f="22861">जब</w> 
    ...

I just want to use hexadecimal code instead of अ and व in the above command.

If using hexadecimal code is not at all possible , can I use unicode instead of hexadecimal code for character set ('अ-व') ?

I am using Ubuntu 10.04

Best Answer

Look at grep: Find all lines that contain Japanese kanjis.

Text is usually encoded in UTF-8; so you have to use the hex vales of the bytes used in UTF-8 encoding.

grep "["$'\xe0\xa4\x85'"-"$'\xe0\xa4\xb5'"]"

and

grep '[अ-व]'

are equivalent, and they perform a locale-based matching (that is, matching is dependent on the sorting rules of Devanagari script (that is, the matching is NOT "any char between \u0905 and \0935" but instead "anything sorting between Devanagari A and Devanagari VA"; there may be differences.

($'...' is the "ANSI-C escape string" syntax for bash, ksh, and zsh. It is just an easier way to type the characters. You can also use the \uXXXX and \UXXXXXXXX escapes to directly ask for code points in bash and zsh.)

On the other hand, you have this (note -P):

grep -P "\xe0\xa4[\x85-\xb5]"

that will do a binary matching with those byte values.

Related Question