I'm trying to sort a text file linewise by their Unicode values. As far as I can tell, this means numerals first, then letters, then CJK-Ideographs. However, sort
(with LC_ALL=C
) fails horribly at this task. Here is an excerpt from my list:
[#ゆうかりんちゃんねる]
[チ→ム♂ツナギ]
[ぞめ]
...
[サディスティックブラウニー]
[ほねとかわとがはなれるおと]
[10th Avenue Cafe]
[2nd Flush]
...
[Alstroemeria Records & Cradle]
[ALTERNATIVE]
[アルトノイラント - Altneuland]
[Amateras Records]
[セブンスヘブンAmmy's]
[anagram]
[Analyze]
...
[Z.S.G TRAXXX]
[α music]
[Яiselied]
[一人華飯スペシャル]
[七瀬屋]
It seems like sort
ignores (at least sometimes) the characters it can't read, because Altneuland
would indeed be between Alternative
and Amateras Records
. Someone suggested using msort
, but it failed as well (with options -u c
, -u d
, and -u n
, respectively).
First, why is it acting so unexpected?
Second, how can I fix this?
Add:// I'm using Raspbian on a Raspberry Pi (B)
Best Answer
What system are you using?
Where
your-file.txt
is the text you posted in utf-8 encoding, sorts as:On my system (sort from GNU coreutils 8.13, Debian EGLIBC 2.13-38). Which when piped to
cut -c2 | tr -d \\n | recode ..dump
gives:Same on an older system with
sort
from GNU coreutils 7.4, EGLIBC 2.11.1-0ubuntu7.12