While trying to answer this question about SQL sorting, I noticed a sort
order I did not expect:
$ export LC_ALL=en_US.UTF-8
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700A Grouped
T-700A Halved
T-700 Whole
$
Why is 700 A
sorted above 700A
, while 700A
is above 700 W
? I would expect a space to come before A
consistently, independent of the characters following it.
It works fine if you use the C locale:
$ export LC_ALL=C
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700 Whole
T-700A Grouped
T-700A Halved
$
Best Answer
Sorting is done in multiple passes. Each character has three (or sometimes more) weights assigned to it. Let's say for this example the weights are
To create the sort key, the nonzero weights of the characters of a string are concatenated, one weight level at a time. That is, if a weight is zero, no corresponding weight is added (as can be seen at the beginning for
" A"
). SoIf you sort these arrays you get the order you see:
This is a simplification of what actually happens; see the Unicode Collation Algorithm for more details. The above example weights are actually from the standard table, with some details omitted.