Command to Retrieve List of Characters in a Given Character Class in Current Locale

character encodinglocale

What could be a way to retrieve a list of all the characters in a given character class (like blank, alpha, digit…) in the current locale.

For instance,

LC_ALL=en_GB.UTF-8 that-command blank

ideally, on my Debian system, would display something like:

      09 U+0009 HORIZONTAL TAB
      20 U+0020 SPACE
e1 9a 80 U+1680 OGHAM SPACE MARK
e1 a0 8e U+180E MONGOLIAN VOWEL SEPARATOR
e2 80 80 U+2000 EN QUAD
e2 80 81 U+2001 EM QUAD
e2 80 82 U+2002 EN SPACE
e2 80 83 U+2003 EM SPACE
e2 80 84 U+2004 THREE-PER-EM SPACE
e2 80 85 U+2005 FOUR-PER-EM SPACE
e2 80 86 U+2006 SIX-PER-EM SPACE
e2 80 88 U+2008 PUNCTUATION SPACE
e2 80 89 U+2009 THIN SPACE
e2 80 8a U+200A HAIR SPACE
e2 81 9f U+205F MEDIUM MATHEMATICAL SPACE
e3 80 80 U+3000 IDEOGRAPHIC SPACE

And in the C locale could display something like:

09 U+0009 HORIZONTAL TAB
20 U+0020 SPACE

That is, the representation of the character in the locale in terms of arrays of bytes, (like UTF-8 in the first example, and single byte in the second), the equivalent Unicode character codepoint and a description.

Context

(edit) Now that the vulnerability has long been patched and disclosed, I can add a bit of context.

I asked that question at the time I was investigating CVE 2014-0475. glibc had a bug in that it let the user use locales like LC_ALL=../../../../tmp/evil-locale that are resolved relative to the standard system locale search path and thus allow to use any file as locale definition.

I could create a rogue locale for instance with a single byte per character charset where most characters except s, h and a few others were considered blanks and that would make bash run sh while parsing a typical Debian /etc/bash.bashrc file (and that could be used to get shell access on a git hosting server for instance provided bash is used as the login shell of the git server user and that the ssh server accepts LC_*/LANG variables and that the attacker can upload files to the server).

Now, if I ever found a LC_CTYPE (compiled locale definition) in /tmp/evil, how would I find out it was a rogue one and in which way.

So my goal is to un-compile those locale definition, and if not, at least know which character (along with their encoding) are in a given character class.

So with that in mind:

Solutions that look at the source files for the locale (the locale definitions like the ones in /usr/share/i18n/locale on Debian) are of no use in my case.
Unicode character properties are irrelevant. I only care about what the locale says. On a Debian system, even between two UTF-8 system locales, let alone rogue ones, the list of characters in a class can be different.
Tools like recode, python or perl that do the byte/multi-byte to/from character conversion can't be used as they may (and in practice do) make the conversion in a different way than the locale.

Best Answer

POSSIBLE FINAL SOLUTION

So I've taken all of the below information and come up with this:

for class in $(
    locale -v LC_CTYPE | 
    sed 's/combin.*//;s/;/\n/g;q'
) ; do 
    printf "\n\t%s\n\n" $class
    recode u2/test16 -q </dev/null | 
    tr -dc "[:$class:]" | 
    od -A n -t a -t o1z -w12
done

NOTE:

I use od as the final filter above for preference and because I know I won't be working with multi-byte characters, which it will not correctly handle. recode u2..dump will both generate output more like that specified in the question and handle wide characters correctly.

OUTPUT

        upper

   A   B   C   D   E   F   G   H   I   J   K   L
 101 102 103 104 105 106 107 110 111 112 113 114  >ABCDEFGHIJKL<
   M   N   O   P   Q   R   S   T   U   V   W   X
 115 116 117 120 121 122 123 124 125 126 127 130  >MNOPQRSTUVWX<
   Y   Z
 131 132                                          >YZ<

        lower

   a   b   c   d   e   f   g   h   i   j   k   l
 141 142 143 144 145 146 147 150 151 152 153 154  >abcdefghijkl<
   m   n   o   p   q   r   s   t   u   v   w   x
 155 156 157 160 161 162 163 164 165 166 167 170  >mnopqrstuvwx<
   y   z
 171 172                                          >yz<

        alpha

   A   B   C   D   E   F   G   H   I   J   K   L
 101 102 103 104 105 106 107 110 111 112 113 114  >ABCDEFGHIJKL<
   M   N   O   P   Q   R   S   T   U   V   W   X
 115 116 117 120 121 122 123 124 125 126 127 130  >MNOPQRSTUVWX<
   Y   Z   a   b   c   d   e   f   g   h   i   j
 131 132 141 142 143 144 145 146 147 150 151 152  >YZabcdefghij<
   k   l   m   n   o   p   q   r   s   t   u   v
 153 154 155 156 157 160 161 162 163 164 165 166  >klmnopqrstuv<
   w   x   y   z
 167 170 171 172                                  >wxyz<

        digit

   0   1   2   3   4   5   6   7   8   9
 060 061 062 063 064 065 066 067 070 071          >0123456789<

       xdigit                                                                                          

   0   1   2   3   4   5   6   7   8   9   A   B
 060 061 062 063 064 065 066 067 070 071 101 102  >0123456789AB<
   C   D   E   F   a   b   c   d   e   f
 103 104 105 106 141 142 143 144 145 146          >CDEFabcdef<

        space

  ht  nl  vt  ff  cr  sp
 011 012 013 014 015 040                          >..... <

        print

  sp   !   "   #   $   %   &   '   (   )   *   +
 040 041 042 043 044 045 046 047 050 051 052 053  > !"#$%&'()*+<
   ,   -   .   /   0   1   2   3   4   5   6   7
 054 055 056 057 060 061 062 063 064 065 066 067  >,-./01234567<
   8   9   :   ;   <   =   >   ?   @   A   B   C
 070 071 072 073 074 075 076 077 100 101 102 103  >89:;<=>?@ABC<
   D   E   F   G   H   I   J   K   L   M   N   O
 104 105 106 107 110 111 112 113 114 115 116 117  >DEFGHIJKLMNO<
   P   Q   R   S   T   U   V   W   X   Y   Z   [
 120 121 122 123 124 125 126 127 130 131 132 133  >PQRSTUVWXYZ[<
   \   ]   ^   _   `   a   b   c   d   e   f   g
 134 135 136 137 140 141 142 143 144 145 146 147  >\]^_`abcdefg<
   h   i   j   k   l   m   n   o   p   q   r   s
 150 151 152 153 154 155 156 157 160 161 162 163  >hijklmnopqrs<
   t   u   v   w   x   y   z   {   |   }   ~
 164 165 166 167 170 171 172 173 174 175 176      >tuvwxyz{|}~<

        graph

   !   "   #   $   %   &   '   (   )   *   +   ,
 041 042 043 044 045 046 047 050 051 052 053 054  >!"#$%&'()*+,<
   -   .   /   0   1   2   3   4   5   6   7   8
 055 056 057 060 061 062 063 064 065 066 067 070  >-./012345678<
   9   :   ;   <   =   >   ?   @   A   B   C   D
 071 072 073 074 075 076 077 100 101 102 103 104  >9:;<=>?@ABCD<
   E   F   G   H   I   J   K   L   M   N   O   P
 105 106 107 110 111 112 113 114 115 116 117 120  >EFGHIJKLMNOP<
   Q   R   S   T   U   V   W   X   Y   Z   [   \
 121 122 123 124 125 126 127 130 131 132 133 134  >QRSTUVWXYZ[\<
   ]   ^   _   `   a   b   c   d   e   f   g   h
 135 136 137 140 141 142 143 144 145 146 147 150  >]^_`abcdefgh<
   i   j   k   l   m   n   o   p   q   r   s   t
 151 152 153 154 155 156 157 160 161 162 163 164  >ijklmnopqrst<
   u   v   w   x   y   z   {   |   }   ~
 165 166 167 170 171 172 173 174 175 176          >uvwxyz{|}~<

        blank

  ht  sp
 011 040                                          >. <

        cntrl

 nul soh stx etx eot enq ack bel  bs  ht  nl  vt
 000 001 002 003 004 005 006 007 010 011 012 013  >............<
  ff  cr  so  si dle dc1 dc2 dc3 dc4 nak syn etb
 014 015 016 017 020 021 022 023 024 025 026 027  >............<
 can  em sub esc  fs  gs  rs  us del
 030 031 032 033 034 035 036 037 177              >.........<

        punct

   !   "   #   $   %   &   '   (   )   *   +   ,
 041 042 043 044 045 046 047 050 051 052 053 054  >!"#$%&'()*+,<
   -   .   /   :   ;   <   =   >   ?   @   [   \
 055 056 057 072 073 074 075 076 077 100 133 134  >-./:;<=>?@[\<
   ]   ^   _   `   {   |   }   ~
 135 136 137 140 173 174 175 176                  >]^_`{|}~<

        alnum

   0   1   2   3   4   5   6   7   8   9   A   B
 060 061 062 063 064 065 066 067 070 071 101 102  >0123456789AB<
   C   D   E   F   G   H   I   J   K   L   M   N
 103 104 105 106 107 110 111 112 113 114 115 116  >CDEFGHIJKLMN<
   O   P   Q   R   S   T   U   V   W   X   Y   Z
 117 120 121 122 123 124 125 126 127 130 131 132  >OPQRSTUVWXYZ<
   a   b   c   d   e   f   g   h   i   j   k   l
 141 142 143 144 145 146 147 150 151 152 153 154  >abcdefghijkl<
   m   n   o   p   q   r   s   t   u   v   w   x
 155 156 157 160 161 162 163 164 165 166 167 170  >mnopqrstuvwx<
   y   z

PROGRAMMER'S API

As I demonstrate below, recode will provide you your complete character map. According to its manual, it does this according first to the current value of the DEFAULT_CHARSET environment variable, or, failing that, it operates exactly as you specify:

When a charset name is omitted or left empty, the value of the DEFAULT_CHARSET variable in the environment is used instead. If this variable is not defined, the recode library uses the current locale's encoding. On POSIX compliant systems, this depends on the first non-empty value among the environment variables LC_ALL, LC_CTYPE, LANG and can be determined through the command locale charmap.

Also worth noting about recode is that it is an api:

The program named recode is just an application of its recoding library. The recoding library is available separately for other C programs. A good way to acquire some familiarity with the recoding library is to get acquainted with the recode program itself.

To use the recoding library once it is installed, a C program needs to have a line:

#include <recode.h>

For internationally-friendly string comparison The POSIX and C standards define the strcoll() function:

The strcoll() function shall compare the string pointed to by s1 to the string pointed to by s2, both interpreted as appropriate to the LC_COLLATE category of the current locale.

The strcoll() function shall not change the setting of errno if successful.

Since no return value is reserved to indicate an error, an application wishing to check for error situations should set errno to 0, then call strcoll(), then check errno.

Here is a separately located example of its usage:

#include <stdio.h>
#include <string.h>

int main ()
{
   char str1[15];
   char str2[15];
   int ret;


   strcpy(str1, "abc");
   strcpy(str2, "ABC");

   ret = strcoll(str1, str2);

   if(ret > 0)
   {
      printf("str1 is less than str2");
   }
   else if(ret < 0) 
   {
      printf("str2 is less than str1");
   }
   else 
   {
      printf("str1 is equal to str2");
   }

   return(0);
}

Regarding the POSIX character classes, you've already noted you used the C API to find these. For unicode character and classes you can use recode's dump-with-names charset to get the desired output. From its manual again:

For example, the command recode l2..full < input implies a necessary conversion from Latin-2 to UCS-2, as dump-with-names is only connected out from UCS-2. In such cases, recode does not display the original Latin-2 codes in the dump, only the corresponding UCS-2 values. To give a simpler example, the command

 echo 'Hello, world!' | recode us..dump

produces the following output:

UCS2   Mne   Description

0048   H     latin capital letter h 
0065   e     latin small letter e
006C   l     latin small letter l 
006C   l     latin small letter l
006F   o     latin small letter o 
002C   ,     comma 
0020  SP     space 
0077   w     latin small letter w 
006F   o     latin small letter o 
0072   r     latin small letter r 
006C   l     latin small letter l 
0064   d     latin small letter d 
0021   !     exclamation mark 
000A   LF    line feed (lf)

The descriptive comment is given in English and ASCII, yet if the English description is not available but a French one is, then the French description is given instead, using Latin-1. However, if the LANGUAGE or LANG environment variable begins with the letters fr, then listing preference goes to French when both descriptions are available.

Using similar syntax to the above combined with its included test dataset I can get my own character map with:

recode -q u8/test8..dump </dev/null

OUTPUT

UCS2   Mne   Description

0001   SH    start of heading (soh)
0002   SX    start of text (stx)
0003   EX    end of text (etx)    
...
002B   +     plus sign
002C   ,     comma
002D   -     hyphen-minus
...
0043   C     latin capital letter c
0044   D     latin capital letter d
0045   E     latin capital letter e
...
006B   k     latin small letter k
006C   l     latin small letter l
006D   m     latin small letter m
...
007B   (!    left curly bracket
007C   !!    vertical line
007D   !)    right curly bracket
007E   '?    tilde
007F   DT    delete (del)

But for common characters, recode is apparently not necessary. This should give you named chars for everything in 128-byte charset:

printf %b "$(printf \\%04o $(seq 128))" | 
luit -c |
od -A n -t o1z -t a -w12

OUTPUT

 001 002 003 004 005 006 007 010 011 012 013 014  >............<
 soh stx etx eot enq ack bel  bs  ht  nl  vt  ff
...
 171 172 173 174 175 176 177                      >yz{|}~.<
   y   z   {   |   }   ~ del

Of course, only 128-bytes are represented, but that's because my locale, utf-8 charmaps or not, uses the ASCII charset and nothing more. So that's all I get. If I ran it without luit filtering it though, od would would roll it back around and print the same map again up to \0400.

There are two major problems with the above method, though. First there is the system's collation order - for non-ASCII locales the bite values for the charsets are not simply in sequence, which, as I think , is likely the core of the problem you're trying to solve.

Well, GNU tr's manpage states that it will expand the [:upper:] [:lower:] classes in order - but that's not a lot.

I imagine some heavy-handed solution could be implemented with sort but that would be a rather unwieldy tool for a backend programming API.

recode will do this thing correctly, but you didn't seem too in love with the program the other day. Maybe today's edits will cast a more friendly light on it or maybe not.

GNU also offers the gettext function library, and it seems to be able to address this problem at least for the LC_MESSAGES context:

— Function: char * bind_textdomain_codeset (const char *domainname, const char *codeset)

The bind_textdomain_codeset function can be used to specify the output character set for message catalogs for domain domainname. The codeset argument must be a valid codeset name which can be used for the iconv_open function, or a null pointer.

If the codeset parameter is the null pointer, bind_textdomain_codeset returns the currently selected codeset for the domain with the name domainname. It returns NULL if no codeset has yet been selected.

The bind_textdomain_codeset function can be used several times. If used multiple times with the same domainname argument, the later call overrides the settings made by the earlier one.

The bind_textdomain_codeset function returns a pointer to a string containing the name of the selected codeset. The string is allocated internally in the function and must not be changed by the user. If the system went out of core during the execution of bind_textdomain_codeset, the return value is NULL and the global variable errno is set accordingly.

You might also use native Unicode character categories, which are language independent and forego the POSIX classes altogether, or perhaps to call on the former to provide you enough information to define the latter.

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}.

Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".

You should now understand why \P{M}\p{M}*+ is the equivalent of \X. \P{M} matches a code point that is not a combining mark, while \p{M}*+ matches zero or more code points that are combining marks. To match a letter including any diacritics, use \p{L}\p{M}*+. This last regex will always match à, regardless of how it is encoded. The possessive quantifier makes sure that backtracking doesn't cause \P{M}\p{M}*+ to match a non-mark without the combining marks that follow it, which \X would never do.

The same website that provided the above information also discusses Tcl's own POSIX-compliant regex implementation which might be yet another way to achieve your goal.

And last among solutions I will suggest that you can interrogate the LC_COLLATE file itself for the complete and in-order system character map. This may not seem easily done, but I achieved some success with the following after compiling it with localedef as demonstrated below:

<LC_COLLATE od -j2K -a -w2048 -v  | 
tail -n2 | 
cut -d' ' -f$(seq -s',' 4 2 2048) | 
sed 's/nul\|\\0//g;s/  */ /g;:s;
    s/\([^ ]\{1,3\}\) \1/\1/;ts;
    s/\(\([^ ][^ ]*  *\)\{16\}\)/\1\n/g'

 dc1 dc2 dc3 dc4 nak syn etb can c fs c rs c sp ! "
# $ % & ' ( ) * + , - . / 0 1 2
3 4 5 6 7 8 9 : ; < = > ? @ A B
C D E F G H I J K L M N O P Q R
S T U V W X Y Z [ \ ] ^ _ ` a b
c d e f g h i j k l m n o p q r
s t u v w x y z { | } ~ del soh stx etx
eot enq ack bel c ht c vt cr c si dle dc1 del

It is, admittedly, currently flawed but I hope it demonstrates the possibility at least.

AT FIRST BLUSH

strings $_/en_GB

#OUTPUT

int_select "<U0030><U0030>"
...
END LC_TELEPHONE

It really didn't look like much but then I started noticing copy commands throughout the list. The above file seems to copy in "en_US" for instance, and another real big one that it seems they all share to some degree is iso_14651_t1_common.

Its pretty big:

strings $_ | wc -c

#OUTPUT
431545

Here is the intro to /usr/share/i18n/locales/POSIX:

# Territory:
# Revision: 1.1
# Date: 1997-03-15
# Application: general
# Users: general
# Repertoiremap: POSIX
# Charset: ISO646:1993
# Distribution and use is free, also for
# commercial purposes.
LC_CTYPE
# The following is the POSIX Locale LC_CTYPE.
# "alpha" is by default "upper" and "lower"
# "alnum" is by definiton "alpha" and "digit"
# "print" is by default "alnum", "punct" and the <U0020> character
# "graph" is by default "alnum" and "punct"
upper   <U0041>;<U0042>;<U0043>;<U0044>;<U0045>;<U0046>;<U0047>;<U0048>;\
        <U0049>;<U004A>;<U004B>;<U004C>;<U004D>;<U004E>;<U004F>;

...

You can grep through this of course, but you might just:

recode -lf gb

Instead. You'd get something like this:

Dec  Oct Hex   UCS2  Mne  BS_4730

  0  000  00   0000  NU   null (nul)
  1  001  01   0001  SH   start of heading (soh)
...

... AND MORE

There is also luit terminal UTF-8 pty translation device I guess that acts a go-between for XTerms without UTF-8 support. It handles a lot of switches - such as logging all converted bytes to a file or -c as a simple |pipe filter.

I never realized there was so much to this - the locales and character maps and all of that. This is apparently a very big deal but I guess it all goes on behind the scenes. There are - at least on my system - a couple hundred man 3 related results for locale related searches.

And also there is:

zcat /usr/share/i18n/charmaps/UTF-8*gz | less

    CHARMAP
<U0000>     /x00         NULL
<U0001>     /x01         START OF HEADING
<U0002>     /x02         START OF TEXT
<U0003>     /x03         END OF TEXT
<U0004>     /x04         END OF TRANSMISSION
<U0005>     /x05         ENQUIRY
...

That will go on for a very long while.

The Xlib functions handle this all of the time - luit is a part of that package.

The Tcl_uni... functions might prove useful as well.

just a little <tab> completion and man searches and I've learned quite a lot on this subject.

With localedef - you can compile the locales in your I18N directory. The output is funky, and not extraordinarily useful - not like the charmaps at all - but you can get the raw format just as you specify above like I did:

mkdir -p dir && cd $_ ; localedef -f UTF-8 -i en_GB ./ 

ls -l
total 1508
drwxr-xr-x 1 mikeserv mikeserv      30 May  6 18:35 LC_MESSAGES
-rw-r--r-- 1 mikeserv mikeserv     146 May  6 18:35 LC_ADDRESS
-rw-r--r-- 1 mikeserv mikeserv 1243766 May  6 18:35 LC_COLLATE
-rw-r--r-- 1 mikeserv mikeserv  256420 May  6 18:35 LC_CTYPE
-rw-r--r-- 1 mikeserv mikeserv     376 May  6 18:35 LC_IDENTIFICATION
-rw-r--r-- 1 mikeserv mikeserv      23 May  6 18:35 LC_MEASUREMENT
-rw-r--r-- 1 mikeserv mikeserv     290 May  6 18:35 LC_MONETARY
-rw-r--r-- 1 mikeserv mikeserv      77 May  6 18:35 LC_NAME
-rw-r--r-- 1 mikeserv mikeserv      54 May  6 18:35 LC_NUMERIC
-rw-r--r-- 1 mikeserv mikeserv      34 May  6 18:35 LC_PAPER
-rw-r--r-- 1 mikeserv mikeserv      56 May  6 18:35 LC_TELEPHONE
-rw-r--r-- 1 mikeserv mikeserv    2470 May  6 18:35 LC_TIME

Then with od you can read it - bytes and strings:

od -An -a -t u1z -w12 LC_COLLATE | less

 etb dle enq  sp dc3 nul nul nul   T nul nul nul
  23  16   5  32  19   0   0   0  84   0   0   0  >... ....T...<
...

Though it is a long way off from winning a beauty contest, that is usable output. And od is as configurable as you want it to be as well, of course.

I guess I also forgot about these:

    perl -mLocale                                                                                       

 -- Perl module --
Locale::Codes                    Locale::Codes::LangFam           Locale::Codes::Script_Retired
Locale::Codes::Constants         Locale::Codes::LangFam_Codes     Locale::Country
Locale::Codes::Country           Locale::Codes::LangFam_Retired   Locale::Currency
Locale::Codes::Country_Codes     Locale::Codes::LangVar           Locale::Language
Locale::Codes::Country_Retired   Locale::Codes::LangVar_Codes     Locale::Maketext
Locale::Codes::Currency          Locale::Codes::LangVar_Retired   Locale::Maketext::Guts
Locale::Codes::Currency_Codes    Locale::Codes::Language          Locale::Maketext::GutsLoader
Locale::Codes::Currency_Retired  Locale::Codes::Language_Codes    Locale::Maketext::Simple
Locale::Codes::LangExt           Locale::Codes::Language_Retired  Locale::Script
Locale::Codes::LangExt_Codes     Locale::Codes::Script            Locale::gettext
Locale::Codes::LangExt_Retired   Locale::Codes::Script_Codes      locale

I probably forgot about them because I couldn't get them to work. I never use Perl and I don't know how to load a module properly I guess. But the man pages look pretty nice. In any case, something tells me you'll find calling a Perl module at least a little less difficult than did I. And, again, these were already on my computer - and I never even use Perl. There are also notably a few I18N that I wistfully scrolled by knowing full well I wouldn't get them to work either.

Context

Best Answer

POSSIBLE FINAL SOLUTION

NOTE:

OUTPUT

PROGRAMMER'S API

OUTPUT

OUTPUT

AT FIRST BLUSH

Related Solutions

UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

When installing Linux what factors go into choosing the locale for the server

Related Question