Sort – Fix Unexpected Sort Order in en_US.UTF-8 Locale

glibclocalesort

While trying to answer this question about SQL sorting, I noticed a sort order I did not expect:

$ export LC_ALL=en_US.UTF-8  
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700A Grouped
T-700A Halved
T-700 Whole
$ 

Why is 700 A sorted above 700A, while 700A is above 700 W ? I would expect a space to come before A consistently, independent of the characters following it.

It works fine if you use the C locale:

$ export LC_ALL=C
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700 Whole
T-700A Grouped
T-700A Halved
$ 

Best Answer

Sorting is done in multiple passes. Each character has three (or sometimes more) weights assigned to it. Let's say for this example the weights are

         wt#1 wt#2 wt#3
space = [0000.0020.0002]
A     = [1BC2.0020.0008]

To create the sort key, the nonzero weights of the characters of a string are concatenated, one weight level at a time. That is, if a weight is zero, no corresponding weight is added (as can be seen at the beginning for " A"). So

       wt#1   -- wt#2 ---   -- wt#3 ---
" A" = 1BC2   0020   0020   0002   0008
       A      sp     A      sp     A

       wt#1   wt#2   wt#3
"A"  = 1BC2   0020   0008
       A      A      A

       wt#1   -- wt#2 ---   -- wt#3 ---
"A " = 1BC2   0020   0020   0008   0002
       A      A      sp     A      sp

If you sort these arrays you get the order you see:

       1BC2   0020   0008               => "A"
       1BC2   0020   0020   0002   0008 => " A"
       1BC2   0020   0020   0008   0002 => "A "

This is a simplification of what actually happens; see the Unicode Collation Algorithm for more details. The above example weights are actually from the standard table, with some details omitted.

Related Question