How to Sort Lines with >> Double Angle Brackets

sort

It seems that sort behaves strangely on lines such as >>b

$ cat test
a
>>b
b
c
>
>>

$ sort test
>
>>
a
b
>>b
c

I would expect the >>b line to be the third line output by sort but it is the fifth. Why does it do this and is there a way to make sort give my expected output?

I'm using GNU/Linux Ubuntu 16.04.

Best Answer

Sorting algorithms in modern locales are quite complex.

Each character (actually collating element which could consist of a sequence of several characters like the Czech ch) is given a number of collating weights that decide of their sorting order.

When comparing two strings, the first weight of all the characters is used first, and other weights are used later to decide ties if the two strings sorted the same with the first weights.

For instance, in many locales, e, é and E have the same primary weight (they are of the same equivalence class, they all match [=e=]).

So, when comparing for instance echo, été and Enter, in the first pass, e, é and E having the same primary weight, it's the second character that will determine the order (c before n before t).

When comparing été, Été, Ete, after the first pass, they all sort the same, so we use the second pass using the secondary weight. In typical GNU locales, the second weight for latin script characters is used to prioritise accents (no accent comes first, then acute, grave, breve, circumflex...). We'll then need to use the third weight to decide between été and Été and that will be based on case (lowercase first in most locales). There are even characters that end up sorting the same because all their weights are identical.

That is used to sort text in a similar way as dictionaries do, as a human would do.

In a dictionary, you'll find that space and most punctuation characters are also ignored. For instance de facto sorts in between debut and devoid. The first weight of the space character is IGNORE.

On a GNU system, you'll find the core collating rules defined in the /usr/share/i18n/locales/iso14651_t1_common (the path may vary with the distribution). In there, you'll see:

ifdef UPPERCASE_FIRST
<CAP>
else
<MIN>
endif
[...]
ifdef UPPERCASE_FIRST
[...]
<MIN> # 10
[...]
else
[...]
<CAP> # 9
[...]
endif

[...]
order_start <SPECIAL>;forward;backward;forward;forward,position
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
[...]
<U003E> IGNORE;IGNORE;IGNORE;<h> # 140 >
[...]
ifdef DIACRIT_FORWARD
order_start <LATIN>;forward;forward;forward;forward,position
else
order_start <LATIN>;forward;backward;forward;forward,position
endif
[...]
<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U00E9> <e>;<ACA>;<MIN>;IGNORE # 260 é
[...]
<U0045> <e>;<BAS>;<CAP>;IGNORE # 577 E
<U00C9> <e>;<ACA>;<CAP>;IGNORE # 578 É

Which illustrate what we've just said. Both space and > have their first 3 weights set to IGNORE. It's only for strings sorting the same for the first 3 weights that their relative order will be considered (> before space as that <h> would list before the unspecified collating symbol <U0020>).

In locales that define UPPERCASE_FIRST (like /usr/share/i18n/locales/tr_TR), upper case letters will come first (in the 3rd pass). Same with DIACRIT_FORWARD where some locales like de_DE can decide to reverse the order of diacritics for the 2nd pass.

> and >> sort the same in the 1st, 2nd and 3rd pass. In the 4th, > sorts before >> as the empty string sorts before everything.

>>b sorts after b because they sort the same in the first 3 passes, but in the fourth pass, b is IGNORE, so > is greater. It's less than c in the first pass (> ignored, and b before c)... You get the idea.

Now, if you look at the C locale definition. It's a lot simpler. There's only one weight, and the weight is based on the codepoint value from U+0000 to U+10FFFF. So SPC (U+0020) sorts before > (U+003E), which sorts before b (U+0062) which sorts before c (U+0063). No character is ever ignored.

Note that with the GNU libc at least, that order defined in the C locale definition file is ignored when it comes to comparison functions (strcoll() and co. as used by sort). Regardless of the value of LC_CTYPE, with LC_COLLATE=C, strcoll() is equivalent to strcmp(). As in the comparison is always on the byte value, even if those bytes correspond to characters whose unicode codepoint sort the other way round (like 0xA4 U+20AC EURO SIGN vs A5 U+00A5 YEN SIGN in the ISO-8859-15 charset), so LC_ALL=C sort and LC_COLLATE=C sort (provided LC_ALL is not otherwise set) will have the same effect there.

Related Question