How to interpret character ranges in charmap files

character encodinglocaleposix

The charmap file /usr/share/i18n/charmaps/UTF-8.gz has this line:

<U3400>..<U343F> /xe3/x90/x80 <CJK Ideograph Extension A>

The map page for charmap(5) only says that it means a range. Then I found the spec, but it says that the number in the character name is supposed to be in decimal, not hex, and it uses 3 dots as opposed to 2 in the man page. So, how should I interpret character ranges in charmap files? Especially if I see something like

<U3400>..<U3430> /xe3/x90/x80 <CJK Ideograph Extension A>

then is the range in decimal or hex?

Best Answer

glibc allows three-dot decimal ranges (as in POSIX) and two-dot hexadecimal ranges. This doesn't appear to be documented anywhere, but we can see it in the source code. This is not defined portable behaviour, but an extension of glibc and possibly others. If you're writing your own files, use decimal.


Let's confirm that this is the actual behaviour of glibc.

When processing a range, glibc uses:

   if (decimal_ellipsis)
     while (isdigit (*cp) && cp >= from)
       --cp;
   else
     while (isxdigit (*cp) && cp >= from)
       {
         if (!isdigit (*cp) && !isupper (*cp))
           lr_error (lr, _("\
 hexadecimal range format should use only capital characters"));
         --cp;
       }

where isxdigit validates a hex digit, and isdigit decimal. Later, it branches the conversion to integer of the consumed substring in the same way and carries on as you'd expect. Earlier, it has determined the kind of ellipsis in question during parsing, obtained from the lexer.

The UTF-8 charmap file is mechanically generated from unicode.org's UnicodeData.txt, creating 64-codepoint ranges with two dots. I suppose that this convenient auto-generation is at least partially behind the extension, but I don't know. Earlier versions of glibc also generated it, but using a different program and the same format.

Again, this doesn't appear to be documented anywhere, and since it's auto-generated right next to where it's used it conceivably could change, but I imagine it will be stable.


If given something like

<U3400>..<U3430> /xe3/x90/x80 <CJK Ideograph Extension A>

then it is a hexadecimal range, because it uses two dots. With three dots, it would be a POSIX decimal range.

If you're on another system that doesn't have this extension, it would just be a syntax error. A portable character map file should only use the decimal ranges.