Shell – How to specify characters using hexadecimal codes in `grep`

character encodinggrepshellunicode

I am using following command to grep character set range for hexadecimal code 0900 (instead of अ) to 097F (instead of व). How I can use hexadecimal code in place of अ and व?

bzcat archive.bz2 | grep -v '<[अ-व]*\s' | tr '[:punct:][:blank:][:digit:]' '\n' | uniq | grep -o '^[अ-व]*$' | sort -f | uniq -c | sort -nr | head -50000 | awk '{print "<w f=\""$1"\">"$2"</w>"}' > hindi.xml

I get the following output:

    <w f="399651">और</w>
    <w f="264423">एक</w>
    <w f="213707">पर</w>
    <w f="74728">कर</w>
    <w f="44281">तक</w>
    <w f="35125">कई</w>
    <w f="26628">द</w>
    <w f="23981">इन</w>
    <w f="22861">जब</w> 
    ...

I just want to use hexadecimal code instead of अ and व in the above command.

If using hexadecimal code is not at all possible , can I use unicode instead of hexadecimal code for character set ('अ-व') ?

I am using Ubuntu 10.04

Best Answer

Look at grep: Find all lines that contain Japanese kanjis.

Text is usually encoded in UTF-8; so you have to use the hex vales of the bytes used in UTF-8 encoding.

grep "["$'\xe0\xa4\x85'"-"$'\xe0\xa4\xb5'"]"

and

grep '[अ-व]'

are equivalent, and they perform a locale-based matching (that is, matching is dependent on the sorting rules of Devanagari script (that is, the matching is NOT "any char between \u0905 and \0935" but instead "anything sorting between Devanagari A and Devanagari VA"; there may be differences.

($'...' is the "ANSI-C escape string" syntax for bash, ksh, and zsh. It is just an easier way to type the characters. You can also use the \uXXXX and \UXXXXXXXX escapes to directly ask for code points in bash and zsh.)

On the other hand, you have this (note -P):

grep -P "\xe0\xa4[\x85-\xb5]"

that will do a binary matching with those byte values.

Debugging convert

You can add the -debug annotate switch to see what convert's up to.

Example

$ printf "\u2318" | convert -size 100x100 label:@- -font unifont-Medium -pointsize 40 -debug annotate command.png
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 12
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 6.375; height: 14; ascent: 11; descent: -3; max advance: 24; bounds: 0.625,0  6.78125,8; origin: 7,0; pixels per em: 12,12; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 12
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 6.375; height: 14; ascent: 11; descent: -3; max advance: 24; bounds: 0.625,0  6.78125,8; origin: 7,0; pixels per em: 12,12; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 24
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 12.75; height: 28; ascent: 22; descent: -5; max advance: 48; bounds: 1.25,0  13.5781,18; origin: 15,0; pixels per em: 24,24; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 48
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 25.5156; height: 55; ascent: 43; descent: -10; max advance: 96; bounds: 2.48438,0  27.1406,35.5; origin: 29,0; pixels per em: 48,48; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.010u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 96
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 53.0312; height: 110; ascent: 87; descent: -20; max advance: 192; bounds: 4.96875,0  54.2812,70; origin: 59,0; pixels per em: 96,96; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 96
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 53.0312; height: 110; ascent: 87; descent: -20; max advance: 192; bounds: 4.96875,0  54.2812,70; origin: 59,0; pixels per em: 96,96; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 95
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 52.0781; height: 109; ascent: 86; descent: -20; max advance: 190; bounds: 4.92188,0  53.7188,70; origin: 58,0; pixels per em: 95,95; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 94
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 51.1406; height: 108; ascent: 85; descent: -20; max advance: 188; bounds: 4.85938,0  53.1562,68; origin: 57,0; pixels per em: 94,94; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 93
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 51.1875; height: 107; ascent: 84; descent: -20; max advance: 186; bounds: 4.8125,0  52.5781,67; origin: 57,0; pixels per em: 93,93; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 92
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 50.2344; height: 106; ascent: 83; descent: -19; max advance: 184; bounds: 4.76562,0  52.0156,67; origin: 56,0; pixels per em: 92,92; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.020 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 91
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 50.2969; height: 105; ascent: 82; descent: -19; max advance: 182; bounds: 4.70312,0  51.4531,65.4688; origin: 56,0; pixels per em: 91,91; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 90
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 49.3438; height: 103; ascent: 81; descent: -19; max advance: 180; bounds: 4.65625,0  50.8906,65.4688; origin: 55,0; pixels per em: 90,90; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 89
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 48.3906; height: 102; ascent: 81; descent: -19; max advance: 178; bounds: 4.60938,0  50.3281,64; origin: 54,0; pixels per em: 89,89; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 88
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 48.4375; height: 101; ascent: 80; descent: -19; max advance: 176; bounds: 4.5625,0  49.7656,64; origin: 54,0; pixels per em: 88,88; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 87
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 47.5; height: 100; ascent: 79; descent: -18; max advance: 174; bounds: 4.5,0  49.1875,63; origin: 53,0; pixels per em: 87,87; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 86
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 47.5469; height: 99; ascent: 78; descent: -18; max advance: 172; bounds: 4.45312,0  48.625,63; origin: 53,0; pixels per em: 86,86; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 86
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/GetTypeMetrics/798/Annotate
  Metrics: text: ⌘; width: 47.5469; height: 99; ascent: 78; descent: -18; max advance: 172; bounds: 4.45312,0  48.625,63; origin: 53,0; pixels per em: 86,86; underline position: -5.0625; underline thickness: 3.35938
2014-04-10T09:37:04-04:00 0:00.030 0.020u 6.7.8 Annotate convert[16619]: annotate.c/RenderFreetype/1220/Annotate
  Font /usr/share/fonts/msttcorefonts/arialbd.ttf; font-encoding none; text-encoding none; pointsize 86

UPDATE #1 - Debugging further

This issue was irking me so I think I've finally figured it out. The issue is the selection of the font, and it not being able to display that particular glyph.

First off you can use this command to see which fonts you have available within convert. So let's start there.

$ convert -list font | head -8

Path: /etc/ImageMagick/type-ghostscript.xml
  Font: AvantGarde-Book
    family: AvantGarde
    style: Normal
    stretch: Normal
    weight: 400
    glyphs: /usr/share/fonts/default/Type1/a010013l.pfb
    ...

The above shows a sample, every font has lines similar to the above. Incidentally, running this command shows we have several hundred fonts:

$ convert -list font | grep Font | wc -l
262

Next we're going to go through the task of encoding our character, \u2318 using every font we have. This sounds complicated but is fairly trivial with some well thought out one liners via Bash.
```
$ for i in $(convert -list font | grep Font | awk '{print $2}'); \
    do convert -font $i -pointsize 36 label:⌘  ${i}.gif;done
```
This snippet will use a for loop to run through each font, running a modified version of your convert command.
Now we look through the results. Many of the fonts could not display this particular glyph but several could, which would seem to indicate that it's not necessarily a bug in ImageMagick, but rather a limitation of the fonts themselves. Here's a list of the fonts that I had that could display this glyph.
- DejaVu-Sans-Bold
- DejaVu-Sans-Bold-Oblique
- DejaVu-Sans-Book
- DejaVu-Sans-Condensed-Bold
- DejaVu-Sans-Condensed-Bold-Oblique
- DejaVu-Sans-Condensed
- DejaVu-Sans-Condensed-Oblique
- DejaVu-Sans-Mono-Bold
- DejaVu-Sans-Mono-Bold-Oblique
- DejaVu-Sans-Mono-Book
- DejaVu-Sans-Mono-Oblique
- DejaVu-Sans-Oblique
- DejaVu-Serif-Bold
- DejaVu-Serif-Bold-Italic
- DejaVu-Serif-Book
- DejaVu-Serif-Condensed-Bold
- DejaVu-Serif-Condensed-Bold-Italic
- DejaVu-Serif-Condensed
- DejaVu-Serif-Condensed-Italic
- DejaVu-Serif-Italic
- FreeMono-Regular
- FreeSerif-Regular
- STIX-Math-Regular
- STIX-Regular
- VL-Gothic-regular
I visually went through the entire ~260 resulting .gif files to determine which worked and which didn't. Here's a sample of a few of the ones that worked just so you can see them.

Best Answer

Related Solutions

Bash – Merge Two Lists While Removing Duplicates

ImageMagick – Convert UTF-8 Text to Image

Debugging convert

Example

UPDATE #1 - Debugging further

References

Related Question