How to Sort Lines with >> Double Angle Brackets

sort

It seems that sort behaves strangely on lines such as >>b

$ cat test
a
>>b
b
c
>
>>

$ sort test
>
>>
a
b
>>b
c

I would expect the >>b line to be the third line output by sort but it is the fifth. Why does it do this and is there a way to make sort give my expected output?

I'm using GNU/Linux Ubuntu 16.04.

Best Answer

Sorting algorithms in modern locales are quite complex.

Each character (actually collating element which could consist of a sequence of several characters like the Czech ch) is given a number of collating weights that decide of their sorting order.

When comparing two strings, the first weight of all the characters is used first, and other weights are used later to decide ties if the two strings sorted the same with the first weights.

For instance, in many locales, e, é and E have the same primary weight (they are of the same equivalence class, they all match [=e=]).

So, when comparing for instance echo, été and Enter, in the first pass, e, é and E having the same primary weight, it's the second character that will determine the order (c before n before t).

When comparing été, Été, Ete, after the first pass, they all sort the same, so we use the second pass using the secondary weight. In typical GNU locales, the second weight for latin script characters is used to prioritise accents (no accent comes first, then acute, grave, breve, circumflex...). We'll then need to use the third weight to decide between été and Été and that will be based on case (lowercase first in most locales). There are even characters that end up sorting the same because all their weights are identical.

That is used to sort text in a similar way as dictionaries do, as a human would do.

In a dictionary, you'll find that space and most punctuation characters are also ignored. For instance de facto sorts in between debut and devoid. The first weight of the space character is IGNORE.

On a GNU system, you'll find the core collating rules defined in the /usr/share/i18n/locales/iso14651_t1_common (the path may vary with the distribution). In there, you'll see:

ifdef UPPERCASE_FIRST
<CAP>
else
<MIN>
endif
[...]
ifdef UPPERCASE_FIRST
[...]
<MIN> # 10
[...]
else
[...]
<CAP> # 9
[...]
endif

[...]
order_start <SPECIAL>;forward;backward;forward;forward,position
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
[...]
<U003E> IGNORE;IGNORE;IGNORE;<h> # 140 >
[...]
ifdef DIACRIT_FORWARD
order_start <LATIN>;forward;forward;forward;forward,position
else
order_start <LATIN>;forward;backward;forward;forward,position
endif
[...]
<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U00E9> <e>;<ACA>;<MIN>;IGNORE # 260 é
[...]
<U0045> <e>;<BAS>;<CAP>;IGNORE # 577 E
<U00C9> <e>;<ACA>;<CAP>;IGNORE # 578 É

Which illustrate what we've just said. Both space and > have their first 3 weights set to IGNORE. It's only for strings sorting the same for the first 3 weights that their relative order will be considered (> before space as that <h> would list before the unspecified collating symbol <U0020>).

In locales that define UPPERCASE_FIRST (like /usr/share/i18n/locales/tr_TR), upper case letters will come first (in the 3^rd pass). Same with DIACRIT_FORWARD where some locales like de_DE can decide to reverse the order of diacritics for the 2^nd pass.

> and >> sort the same in the 1^st, 2^nd and 3^rd pass. In the 4^th, > sorts before >> as the empty string sorts before everything.

>>b sorts after b because they sort the same in the first 3 passes, but in the fourth pass, b is IGNORE, so > is greater. It's less than c in the first pass (> ignored, and b before c)... You get the idea.

Now, if you look at the C locale definition. It's a lot simpler. There's only one weight, and the weight is based on the codepoint value from U+0000 to U+10FFFF. So SPC (U+0020) sorts before > (U+003E), which sorts before b (U+0062) which sorts before c (U+0063). No character is ever ignored.

Note that with the GNU libc at least, that order defined in the C locale definition file is ignored when it comes to comparison functions (strcoll() and co. as used by sort). Regardless of the value of LC_CTYPE, with LC_COLLATE=C, strcoll() is equivalent to strcmp(). As in the comparison is always on the byte value, even if those bytes correspond to characters whose unicode codepoint sort the other way round (like 0xA4 U+20AC EURO SIGN vs A5 U+00A5 YEN SIGN in the ISO-8859-15 charset), so LC_ALL=C sort and LC_COLLATE=C sort (provided LC_ALL is not otherwise set) will have the same effect there.

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

Short version: collation doesn't really work in command line utilities.

Longer version: the underlying function to compare two strings is strcoll. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm constructs this canonical form.

Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):

$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b

As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8 locale. They are, however, present in a Japanese locale.

$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇 
〼 303030
〇 3c9b

The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US, which includes /usr/share/i18n/locales/iso14651_t1_common. This file doesn't have an entry for U3007 or U303C, nor are they included in any range that I can find.

I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is

The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.

It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.

Sort by hex value

A solution in perl:

$ perl -anle '
    push @h, [$F[-1],$_];
    END {
        print for map  { $_->[0] }
                  sort { $a->[1] <=> $b->[1] }
                  map  { [$_->[1],hex($_->[0])] } @h;
    }
' file
4 jjk 7
5 hhf 25
2 ukr 9f
3 ezh ae
1 hdh d12

Explanation

While processing file, we create an array of array @h, each of its element is an array reference [$F[-1],$_], with first element is the hex value to compare, and the second element is the whole line.
In END block, we use Schwartzian transform:
- With each element of @h, create an anonymous array, contains the whole line ( $_->[1] the second element of each array ref in @h ) and the hex value to compare hex($_->[0])]
- Sort above array base on the hex value $a->[1] <=> $b->[1]
- Get the first element of each array ref in sorted array map { $_->[0] } then print the result.

Update

With @Joseph R's suggestion, without using Schwartzian Transform:

$ perl -anle '
    push @h, [hex($F[-1]),$_];
    END {
        print $_->[1] for
            sort { $a->[0] <=> $b->[0] } @h;
    }
' file

Update 2

After reading stefan's comment, I think this can call direct:

$ perl -e '
    print sort {hex((split(/\s+/,$a))[-1]) <=> hex((split(/\s+/,$b))[-1])} <>;
' file
4 jjk 7
5 hhf 25
2 ukr 9f
3 ezh ae
1 hdh d12

Best Answer

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

Sort by hex value

Related Question