Sort Command – Why Does Sort Say ‘? = e’?

localesortunicode

ɛ ("Latin epsilon") is a letter used in certain African languages, usually to represent the vowel sound in English "bed". In Unicode it's encoded as U+025B, very distinct from everyday e.

However, if I sort the following:

eb
ed
ɛa
ɛc

it seems that sort considers ɛ and e equivalent:

ɛa
eb
ɛc
ed

What's going on here? And is there a way to make ɛ and e distinct for sorting purposes?

Best Answer

No, it doesn't consider them as equivalent, they just have the same primary weight. So that, in first approximation, they sort the same.

If you look at /usr/share/i18n/locales/iso14651_t1_common (as used as basis for most locales) on a GNU system (here with glibc 2.27), you'll see:

<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U025B> <e>;<PCL>;<MIN>;IGNORE # 287 ɛ
<U0045> <e>;<BAS>;<CAP>;IGNORE # 577 E

e, ɛ and E have the same primary weight, e and E same secondary weight, only the third weight differentiates them.

When comparing strings, sort (the strcoll() standard libc function is uses to compare strings) starts by comparing the primary weights of all characters, and only go for the second weight if the strings are equal with the primary weights (and so on with the other weights).

That's how case seems to be ignored in the sorting order in first approximation. Ab sorts between aa and ac, but Ab can sort before or after ab depending on the language rule (some languages have <MIN> before <CAP> like in British English, some <CAP> before <MIN> like in Estonian).

If e had the same sorting order as ɛ, printf '%s\n' e ɛ | sort -u would return only one line. But as <BAS> sorts before <PCL>, e alone sorts before ɛ. eɛe sorts after EEE (at the secondary weight) even though EEE sorts after eee (for which we need to go up to the third weight).

Now if on my system with glibc 2.27, I run:

sed -n 's/\(.*;[^[:blank:]]*\).*/\1/p' /usr/share/i18n/locales/iso14651_t1_common |
  sort -k2 | uniq -Df1

You'll notice that there are quite a few characters that have been defined with the exact same 4 weights. In particular, our ɛ has the same weights as:

<U01DD> <e>;<PCL>;<MIN>;IGNORE
<U0259> <e>;<PCL>;<MIN>;IGNORE
<U025B> <e>;<PCL>;<MIN>;IGNORE

And sure enough:

$ printf '%s\n' $'\u01DD' $'\u0259' $'\u025B' | sort -u
ǝ
$ expr ɛ = ǝ
1

That can be seen as a bug of GNU libc locales. On most other systems, locales make sure all different characters have different sorting order in the end. On GNU locales, it gets even worse, as there are thousands of characters that don't have a sorting order and end up sorting the same, causing all sorts of problems (like breaking comm, join, ls or globs having non-deterministic orders...), hence the recommendation of using LC_ALL=C to work around those issues.

As noted by @ninjalj in comments, glibc 2.28 released in August 2018 came with some improvements on that front though AFAICS, there are still some characters or collating elements defined with identical sorting order. On Ubuntu 18.10 with glibc 2.28 and in a en_GB.UTF-8 locale.

$ expr $'L\ub7' = $'L\u387'
1

(why would U+00B7 be considered equivalent as U+0387 only when combined with L/l?!).

And:

$ perl -lC -e 'for($i=0; $i<0x110000; $i++) {$i = 0xe000 if $i == 0xd800; print chr($i)}' | sort > all-chars-sorted
$ uniq -d all-chars-sorted | wc -l
4
$ uniq -D all-chars-sorted | wc -l
1061355

(still over 1 million characters (95% of the Unicode range, down from 98% in 2.27) sorting the same as other characters as their sorting order is not defined).

Related Solutions

Sort lines by unicode value

What system are you using?

LC_ALL=C sort < your-file.txt

Where your-file.txt is the text you posted in utf-8 encoding, sorts as:

[#ゆうかりんちゃんねる]
[10th Avenue Cafe]
[2nd Flush]
[ALTERNATIVE]
[Alstroemeria Records & Cradle]
[Amateras Records]
[Analyze]
[Z.S.G TRAXXX]
[anagram]
[α music]
[Яiselied]
[ぞめ]
[ほねとかわとがはなれるおと]
[アルトノイラント - Altneuland]
[サディスティックブラウニー]
[セブンスヘブンAmmy's]
[チ→ム♂ツナギ]
[一人華飯スペシャル]
[七瀬屋]

On my system (sort from GNU coreutils 8.13, Debian EGLIBC 2.13-38). Which when piped to cut -c2 | tr -d \\n | recode ..dump gives:

UCS2   Mne   Description

0023   Nb    number sign
0031   1     digit one
0032   2     digit two
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
005A   Z     latin capital letter z
0061   a     latin small letter a
03B1   a*    greek small letter alpha
042F   JA    cyrillic capital letter ya
305E   zo    hiragana letter zo
307B   ho    hiragana letter ho
30A2   A6    katakana letter a
30B5   Sa    katakana letter sa
30BB   Se    katakana letter se
30C1   Ti    katakana letter ti
4E00
4E03

Same on an older system with sort from GNU coreutils 7.4, EGLIBC 2.11.1-0ubuntu7.12

Sort and ls — why aren’t capitalized letters sorted first

Check your environment variable LC_COLLATE. The easiest thing will be to use the command locales. If you want, you can set it to a different value. For example, you can do (assuming bash)

export LC_COLLATE="C"

and that should fix your issue.

Best Answer

Related Solutions

Sort lines by unicode value

Sort and ls — why aren’t capitalized letters sorted first

Related Question