What could be a way to retrieve a list of all the characters in a given character class (like blank
, alpha
, digit
…) in the current locale.
For instance,
LC_ALL=en_GB.UTF-8 that-command blank
ideally, on my Debian system, would display something like:
09 U+0009 HORIZONTAL TAB
20 U+0020 SPACE
e1 9a 80 U+1680 OGHAM SPACE MARK
e1 a0 8e U+180E MONGOLIAN VOWEL SEPARATOR
e2 80 80 U+2000 EN QUAD
e2 80 81 U+2001 EM QUAD
e2 80 82 U+2002 EN SPACE
e2 80 83 U+2003 EM SPACE
e2 80 84 U+2004 THREE-PER-EM SPACE
e2 80 85 U+2005 FOUR-PER-EM SPACE
e2 80 86 U+2006 SIX-PER-EM SPACE
e2 80 88 U+2008 PUNCTUATION SPACE
e2 80 89 U+2009 THIN SPACE
e2 80 8a U+200A HAIR SPACE
e2 81 9f U+205F MEDIUM MATHEMATICAL SPACE
e3 80 80 U+3000 IDEOGRAPHIC SPACE
And in the C locale could display something like:
09 U+0009 HORIZONTAL TAB
20 U+0020 SPACE
That is, the representation of the character in the locale in terms of arrays of bytes, (like UTF-8 in the first example, and single byte in the second), the equivalent Unicode character codepoint and a description.
Context
(edit) Now that the vulnerability has long been patched and disclosed, I can add a bit of context.
I asked that question at the time I was investigating CVE 2014-0475. glibc
had a bug in that it let the user use locales like LC_ALL=../../../../tmp/evil-locale
that are resolved relative to the standard system locale search path and thus allow to use any file as locale definition.
I could create a rogue locale for instance with a single byte per character charset where most characters except s
, h
and a few others were considered blanks and that would make bash
run sh
while parsing a typical Debian /etc/bash.bashrc
file (and that could be used to get shell access on a git
hosting server for instance provided bash
is used as the login shell of the git
server user and that the ssh
server accepts LC_*
/LANG
variables and that the attacker can upload files to the server).
Now, if I ever found a LC_CTYPE
(compiled locale definition) in /tmp/evil
, how would I find out it was a rogue one and in which way.
So my goal is to un-compile those locale definition, and if not, at least know which character (along with their encoding) are in a given character class.
So with that in mind:
- Solutions that look at the source files for the locale (the locale definitions like the ones in
/usr/share/i18n/locale
on Debian) are of no use in my case. - Unicode character properties are irrelevant. I only care about what the locale says. On a Debian system, even between two UTF-8 system locales, let alone rogue ones, the list of characters in a class can be different.
- Tools like
recode
,python
orperl
that do the byte/multi-byte to/from character conversion can't be used as they may (and in practice do) make the conversion in a different way than the locale.
Best Answer
POSSIBLE FINAL SOLUTION
So I've taken all of the below information and come up with this:
NOTE:
I use
od
as the final filter above for preference and because I know I won't be working with multi-byte characters, which it will not correctly handle.recode u2..dump
will both generate output more like that specified in the question and handle wide characters correctly.OUTPUT
PROGRAMMER'S API
As I demonstrate below,
recode
will provide you your complete character map. According to its manual, it does this according first to the current value of theDEFAULT_CHARSET
environment variable, or, failing that, it operates exactly as you specify:Also worth noting about
recode
is that it is an api:#include <recode.h>
For internationally-friendly string comparison The
POSIX
andC
standards define thestrcoll()
function:Here is a separately located example of its usage:
Regarding the
POSIX
character classes, you've already noted you used theC
API to find these. For unicode character and classes you can userecode's
dump-with-names charset to get the desired output. From its manual again:Using similar syntax to the above combined with its included test dataset I can get my own character map with:
OUTPUT
But for common characters,
recode
is apparently not necessary. This should give you named chars for everything in 128-byte charset:OUTPUT
Of course, only 128-bytes are represented, but that's because my locale, utf-8 charmaps or not, uses the ASCII charset and nothing more. So that's all I get. If I ran it without
luit
filtering it though,od
would would roll it back around and print the same map again up to\0400.
There are two major problems with the above method, though. First there is the system's collation order - for non-ASCII locales the bite values for the charsets are not simply in
seq
uence, which, as I think , is likely the core of the problem you're trying to solve.Well, GNU
tr's man
page states that it will expand the[:upper:]
[:lower:]
classes in order - but that's not a lot.I imagine some heavy-handed solution could be implemented with
sort
but that would be a rather unwieldy tool for a backend programming API.recode
will do this thing correctly, but you didn't seem too in love with the program the other day. Maybe today's edits will cast a more friendly light on it or maybe not.GNU also offers the
gettext
function library, and it seems to be able to address this problem at least for theLC_MESSAGES
context:You might also use native Unicode character categories, which are language independent and forego the POSIX classes altogether, or perhaps to call on the former to provide you enough information to define the latter.
The same website that provided the above information also discusses
Tcl
's own POSIX-compliant regex implementation which might be yet another way to achieve your goal.And last among solutions I will suggest that you can interrogate the
LC_COLLATE
file itself for the complete and in-order system character map. This may not seem easily done, but I achieved some success with the following after compiling it withlocaledef
as demonstrated below:It is, admittedly, currently flawed but I hope it demonstrates the possibility at least.
AT FIRST BLUSH
It really didn't look like much but then I started noticing
copy
commands throughout the list. The above file seems tocopy
in "en_US" for instance, and another real big one that it seems they all share to some degree isiso_14651_t1_common
.Its pretty big:
Here is the intro to
/usr/share/i18n/locales/POSIX
:...
You can
grep
through this of course, but you might just:Instead. You'd get something like this:
... AND MORE
There is also
luit
terminal UTF-8pty
translation device I guess that acts a go-between for XTerms without UTF-8 support. It handles a lot of switches - such as logging all converted bytes to a file or-c
as a simple|pipe
filter.I never realized there was so much to this - the locales and character maps and all of that. This is apparently a very big deal but I guess it all goes on behind the scenes. There are - at least on my system - a couple hundred
man 3
related results for locale related searches.And also there is:
That will go on for a very long while.
The
Xlib
functions handle this all of the time -luit
is a part of that package.The
Tcl_uni...
functions might prove useful as well.just a little
<tab>
completion andman
searches and I've learned quite a lot on this subject.With
localedef
- you can compile thelocales
in yourI18N
directory. The output is funky, and not extraordinarily useful - not like thecharmaps
at all - but you can get the raw format just as you specify above like I did:Then with
od
you can read it - bytes and strings:Though it is a long way off from winning a beauty contest, that is usable output. And
od
is as configurable as you want it to be as well, of course.I guess I also forgot about these:
I probably forgot about them because I couldn't get them to work. I never use
Perl
and I don't know how to load a module properly I guess. But theman
pages look pretty nice. In any case, something tells me you'll find calling a Perl module at least a little less difficult than did I. And, again, these were already on my computer - and I never even use Perl. There are also notably a fewI18N
that I wistfully scrolled by knowing full well I wouldn't get them to work either.