Sort lines by unicode value

sortunicode

I'm trying to sort a text file linewise by their Unicode values. As far as I can tell, this means numerals first, then letters, then CJK-Ideographs. However, sort (with LC_ALL=C) fails horribly at this task. Here is an excerpt from my list:

[#ゆうかりんちゃんねる]
[チ→ム♂ツナギ]
[ぞめ]
...
[サディスティックブラウニー]
[ほねとかわとがはなれるおと]
[10th Avenue Cafe]
[2nd Flush]
...
[Alstroemeria Records & Cradle]
[ALTERNATIVE]
[アルトノイラント - Altneuland]
[Amateras Records]
[セブンスヘブンAmmy's]
[anagram]
[Analyze]
...
[Z.S.G TRAXXX]
[α music]
[Яiselied]
[一人華飯スペシャル]
[七瀬屋]

It seems like sort ignores (at least sometimes) the characters it can't read, because Altneuland would indeed be between Alternative and Amateras Records. Someone suggested using msort, but it failed as well (with options -u c, -u d, and -u n, respectively).

First, why is it acting so unexpected?
Second, how can I fix this?

Add:// I'm using Raspbian on a Raspberry Pi (B)

Best Answer

What system are you using?

LC_ALL=C sort < your-file.txt

Where your-file.txt is the text you posted in utf-8 encoding, sorts as:

[#ゆうかりんちゃんねる]
[10th Avenue Cafe]
[2nd Flush]
[ALTERNATIVE]
[Alstroemeria Records & Cradle]
[Amateras Records]
[Analyze]
[Z.S.G TRAXXX]
[anagram]
[α music]
[Яiselied]
[ぞめ]
[ほねとかわとがはなれるおと]
[アルトノイラント - Altneuland]
[サディスティックブラウニー]
[セブンスヘブンAmmy's]
[チ→ム♂ツナギ]
[一人華飯スペシャル]
[七瀬屋]

On my system (sort from GNU coreutils 8.13, Debian EGLIBC 2.13-38). Which when piped to cut -c2 | tr -d \\n | recode ..dump gives:

UCS2   Mne   Description

0023   Nb    number sign
0031   1     digit one
0032   2     digit two
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
005A   Z     latin capital letter z
0061   a     latin small letter a
03B1   a*    greek small letter alpha
042F   JA    cyrillic capital letter ya
305E   zo    hiragana letter zo
307B   ho    hiragana letter ho
30A2   A6    katakana letter a
30B5   Sa    katakana letter sa
30BB   Se    katakana letter se
30C1   Ti    katakana letter ti
4E00
4E03

Same on an older system with sort from GNU coreutils 7.4, EGLIBC 2.11.1-0ubuntu7.12