Bash – Generate collating order of a list of individual characters

bashcharacter encodingshellzsh

In some cases, the collating order of each individual character needs to be known (to be used). It is commonly expressed in the character class of a regex like [b-d]. That character class will match only one character in the range given.

Which individual characters are those in the range b-d (or other range).

It is also known that the collating order in the C locale is the byte value of each ASCII character[a] (showing only characters from 33 to 126):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Could the character range be expanded beyond ASCII?

But:

What is the collating order of individual characters in other locales ?

Is there a way to s̲h̲o̲w̲ such collating order (in any locale)?

[a] In systems where ASCII is used (most of the systems), but others may use EBCDIC or even something else.

Best Answer

There are several aspects to that. We need to list all the characters in the locale's charset, select the graphical ones (like your 33 to 126 ASCII ones) and sort them.

And there's also the question of whether it makes sense to talk of collation order for characters or if it's ever defined. I'll start with that last point.

The POSIX collation algorithm

If we're talking of the C/POSIX collation algorithm as implemented by strcoll() and used by sort, or ls, or shell globs or awk/expr's <, > string comparison operator and more generally most tools that sort text based on the user's locale on POSIX systems, note that they're for comparing strings.

While in a en_US.UTF-8 locale on a GNU system, the é string that consists of a single é character sorts after the string that consists of a single e character, Stéphane sorts before Stephanie. In a cs_CZ.UTF-8 locale, c sorts between b and d, but ch sorts between h and i.

The collation algorithm considers the whole string, not characters individually. So knowing the order of characters when compared individually does not necessarily tell us how strings containing those characters will compare.

That algorithm is designed to compare strings like many languages do in the real world like in dictionaries and phone books. It is a bit too simple to cover all subtleties of sorting in different cultures (see ICU, which implement different algorithms for more on that), but is good enough for most cases.

In that algorithm, collating elements, which include characters but also combination of characters like the multi-part graphemes of the Korean alphabet or the Czech ch or on some systems the é expressed as e followed by a combining acute accent (U+0301) are assigned several weights.

And whole strings are compared using each weight in turn, from the primary weight to the last weight.

For instance, in that en_US.UTF-8 GNU locale, E, é, e, É all have the same primary weight. Stéphane and Stephanie are decomposed in

<S><t><é><p><h><a><n> <e>
<S><t><e><p><h><a><n> <i><e>

collating elements (here, one per character).

Up to n, the collating elements of the two strings have the same primary weight, but i's primary weight is greater that e's, so Stephanie sorts after Stéphane and the secondary weight doesn't even have to be considered.

Now, for Stephane vs Stéphane, when comparing primary weights, they sort the same, so the secondary weight has to be considered. If you look at /usr/share/i18n/locales/iso14651_t1_common on a GNU system as used as-is by the en_US.UTF-8 locale, you see:

<BAS> # 15
[...]
<ACA> # 18
[...]
<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U00E9> <e>;<ACA>;<MIN>;IGNORE # 260 é

For characters in the latin alphabet, the secondary weight is used to compare diacritics. And the base character (BAS) sorts before the one with acute accent (ACA). So Stéphane sorts after Stephane. To compare STÉPHANE against Stéphane, we'd have to go up to the third weight where upper case sorts after lower case in English (contrary to Estonian for instance).

And there are the non-alphanumeric characters like space or punctuation whose primary weight is IGNORE and are not considered in the first comparison pass (de facto sorts between deface and degree, that doesn't mean space sorts between f and g).

For $'STE\u0301HANE' vs Stéphane, some systems like Solaris will treat E\u0301 as a collating element with same weight except for the last as the É (U+00C9) character, while some others treat \u0301 like punctuation, giving not as good results (like that $'STE\u0301HANE' before Stephane).

Not a total order

On GNU systems the sort order of U+0301 is not even defined, and there are thousands more characters in that case. I like to take the example of the rounded digit numbers (U+2460..U+2473) as those clearly should have a sorting order but don't¹:

$ touch ① ② ③ ④ ⑤
$ ls
④  ③  ⑤  ②  ①
$ ls | sort -u
④

There are also characters that are actually defined as having the exact same weights as others (like Ǝ, Ə, Ɛ that all sort the same here).

For that reason, it is actually not possible to sort arbitrary characters in some locales, unless, like some sort implementations do (and that will be a requirement in the next major version of the POSIX specification), you fall back to a memcmp() like comparison for characters that sort the same.

Listing all graphical characters in the locale's charset

Different locales may use different character sets.

There are three main categories of charsets, single byte character sets like ASCII or iso-8859-x where each byte corresponds to a character (though some may be undefined), multi-byte character sets (and encoding) like UTF-8, GB18030, BIG5 or EUCJP where characters are encoded on a varying number of bytes, and stateful ones, where a byte or sequence of bytes may represent different characters depending on whether a state-shifting code has been issued before.

That last category is rarely used in locales these days and is generally unmanageable so we can ignore it for now.

The C locale itself is guaranteed to have a single-byte character set. It doesn't have to be ASCII though it generally is on systems where it's not EBCDIC-based.

Note that some scripts like the latin one used in English are written left to right while some others are written right to left so having characters of those different scripts (as supported by some charsets) on the same line is not necessarily a good idea.

Same for combining characters which would end up being combined to random characters and together.

Also note that some charsets like Unicode are still evolving. While it's now fixed to 0..0xD7FF, 0xE000..0x10FFFF codepoint ranges, most of those are still unassigned and every new version of Unicode assigns new ones, and system vendors try to keep up.

The characters classified as graph are listed by the ISO/IEC 30112 Technical Report (2014) which follows up on ISO/IEC TR 14652 (2002). GNU locales seem to follow that, some others (like FreeBSD/Solaris) don't, but I won't blame them as that doesn't seem to make much sense to me. For instance, it excludes most spacing characters, but not U+00A0 (non-breaking space), U+2007 (figure space) nor U+200B (zero width space). It includes characters I would consider as control characters, like U+200C..U+200F, U+202D, U+202E...² that latter one, the right-to-left override being critical for this Q&A as it reverses the order of left-to-right characters:

$ printf '%b\n' '\u202E' a b c | sort | paste -sd '\0' -
‮abc

(some browsers will show cba if they support it, others abc).

It also includes most tag characters and also the thousands of private use characters, which are unlikely to be assigned, let alone drawable on your system.

For single byte charsets (on GNU systems, those for which locale ctype-mb-cur-max returns 1), listing graphical characters should just be a matter of looping over all 255 byte values (omitting the first one, NUL in every charset which is not graphical and would cause problems) and match them against [[:graph:]].

One could do that with awk for instance:

awk '
  BEGIN{
    for (i = 1; i < 256; i++) {
      c = sprintf("%c", i)
      if (c ~ /[[:graph:]]/) print c
    }
  }' | sort | paste -sd '\0' -

In a el_GR.iso88597 locale, Greek using the iso8859-7 single byte charset, that would give:

`^¨~<=>¦°_­-,;:!?/.·'ʽʼ"«»()[]{}§©@$£¤¥*\&#%+±ª―΄΅0½12²3³456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZΑαΆάΒβΓγΔδΕεΈέΖζΗηΉήΘθΙιΊίΪϊΐΚκΛλΜμΝνΞξΟοΌόΠπΡρΣσςΤτΥυΎύΫϋΰΦφΧχΨψΩωΏώ 

(with the trailing non-breaking space mis-classified as "graph" by GNU locales).

That approach can't be used for multi-byte characters.

If your iconv supports a UCS-4BE or UTF32BE charset encoding, you can generate all unicode code points as 32bit big endian numbers and convert them to the locale's charset:

perl -e 'print pack("L>*", $_, 10) for 1..0xd7ff, 0xe000..0x10ffff' |
  iconv -cf UCS-4BE |
  grep '[[:graph:]]' |
  sort

Or if it supports a UTF-8 one:

perl -C -M-warnings=nonchar -le 'print chr$_ for 1..0xd7ff, 0xe000..0x10ffff' |
  iconv -cf UTF-8 |
  grep '[[:graph:]]' |
  sort

(left on one character per line to avoid the problems mentioned above and also to avoid generating too long a line).

That works on the fact that Unicode (and thus its encodings) is meant to include characters of all other possible charsets, so there will always be a Unicode codepoint for each character in any charset. Modern systems actually define their charsets in relation to Unicode and their wchar_t usually corresponds to Unicode codepoints.

Now, as discussed above, sort resorts to a memcmp()-based comparison for characters that sort the same with strcoll(). For single-byte charsets, that will be a sort on the codepoint in those charsets; for UTF-8, that will sort on the Unicode code point as UTF-8 has that specific property. For other encodings of Unicode like the Chinese GB18030 or other multibyte charsets, that may more or less seem random.

In any case, that will mean that for two locales with the same collation order, the output of sort will be different if those locales use different character sets.

For instance, if we get back to our ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳ rounded numbers. Unicode specifies them in that order (codepoint 0x2460 to 0x2473). In GNU locales, their order is not defined (① is neither before nor after ②). Sort will put ② after ① in a locale using UTF-8 because UTF-8 order follows the Unicode codepoint order. But in a locale like zh_CN.gb18030 that uses GB18030, another encoding of Unicode from China, the order becomes ⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①②③④⑤⑥⑦⑧⑨⑩, down to how those characters are encoded at byte level (or at least would be if it wasn't for this bug which makes them order as ①②③④⑤⑥⑦⑧⑨⑩⑮⑯⑰⑱⑲⑳⑪⑫⑬⑭ instead).

If you wanted to order the characters of a string based on their collation order, with zsh:

printf "%s\n" ${(j::)${(s::o)string}}

Note that zsh can have NUL characters in its variables, but strcoll() doesn't work with those. zsh tries to work around that, but it's not perfect.

The result would be non-deterministic if strings contained different characters with same sorting order.


¹ 2019 edit the order of ① ② ③ ④ ⑤ has since been fixed in newer versions of the GNU libc, but as of 2.30, over 95% of characters still don't have a defined order, you can replace ① ② ③ ④ ⑤ with ? ? ? ? ? for instance. Hopefully, GNU locales will eventually be fixed completely (they will have to if they want to comply to the next revision of the standard) and the problem will then be limited to user-defined locales

² I believe the rationale is that the non-breaking space, figure space, zero width space etc are not included in the space category on the ground that they should not be used as delimiters (ISO30112 defines them as characters to be classified as white-space characters, to find syntactical boundaries), and graph is defined as the non-spacing printable characters (the characters that are in the print category and not the space category, though the text of ISO30112 defines it as characters to be classified as printable characters, not including the <space> character). So in effect, that's graphical characters and printable non-graphical characters that are not used as syntactical boundaries.