Sort lines by unicode value

sortunicode

I'm trying to sort a text file linewise by their Unicode values. As far as I can tell, this means numerals first, then letters, then CJK-Ideographs. However, sort (with LC_ALL=C) fails horribly at this task. Here is an excerpt from my list:

[#ゆうかりんちゃんねる]
[チ→ム♂ツナギ]
[ぞめ]
...
[サディスティックブラウニー]
[ほねとかわとがはなれるおと]
[10th Avenue Cafe]
[2nd Flush]
...
[Alstroemeria Records & Cradle]
[ALTERNATIVE]
[アルトノイラント - Altneuland]
[Amateras Records]
[セブンスヘブンAmmy's]
[anagram]
[Analyze]
...
[Z.S.G TRAXXX]
[α music]
[Яiselied]
[一人華飯スペシャル]
[七瀬屋]

It seems like sort ignores (at least sometimes) the characters it can't read, because Altneuland would indeed be between Alternative and Amateras Records. Someone suggested using msort, but it failed as well (with options -u c, -u d, and -u n, respectively).

First, why is it acting so unexpected?
Second, how can I fix this?

Add:// I'm using Raspbian on a Raspberry Pi (B)

Best Answer

What system are you using?

LC_ALL=C sort < your-file.txt

Where your-file.txt is the text you posted in utf-8 encoding, sorts as:

[#ゆうかりんちゃんねる]
[10th Avenue Cafe]
[2nd Flush]
[ALTERNATIVE]
[Alstroemeria Records & Cradle]
[Amateras Records]
[Analyze]
[Z.S.G TRAXXX]
[anagram]
[α music]
[Яiselied]
[ぞめ]
[ほねとかわとがはなれるおと]
[アルトノイラント - Altneuland]
[サディスティックブラウニー]
[セブンスヘブンAmmy's]
[チ→ム♂ツナギ]
[一人華飯スペシャル]
[七瀬屋]

On my system (sort from GNU coreutils 8.13, Debian EGLIBC 2.13-38). Which when piped to cut -c2 | tr -d \\n | recode ..dump gives:

UCS2   Mne   Description

0023   Nb    number sign
0031   1     digit one
0032   2     digit two
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
005A   Z     latin capital letter z
0061   a     latin small letter a
03B1   a*    greek small letter alpha
042F   JA    cyrillic capital letter ya
305E   zo    hiragana letter zo
307B   ho    hiragana letter ho
30A2   A6    katakana letter a
30B5   Sa    katakana letter sa
30BB   Se    katakana letter se
30C1   Ti    katakana letter ti
4E00
4E03

Same on an older system with sort from GNU coreutils 7.4, EGLIBC 2.11.1-0ubuntu7.12

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

Short version: collation doesn't really work in command line utilities.

Longer version: the underlying function to compare two strings is strcoll. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm constructs this canonical form.

Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):

$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b

As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8 locale. They are, however, present in a Japanese locale.

$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇 
〼 303030
〇 3c9b

The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US, which includes /usr/share/i18n/locales/iso14651_t1_common. This file doesn't have an entry for U3007 or U303C, nor are they included in any range that I can find.

I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is

The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.

It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.

How do get unix sort to sort in same order as Java (by unicode value)

The LC_COLLATE locale category controls the sorting order. LC_ALL sets all categories.

With LC_COLLATE=C, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. Therefore LC_COLLATE=C is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.

Note that Java does not actually sort according to the Unicode character values but according to their UTF-16 encoding. This makes a difference with surrogate pairs, i.e. if you have code points above 65535.

Neither UTF-8 byte representation sorting nor Java sorting nor the sort utility in a UTF-8 locale on GNU/Linux take combining characters into account, e.g. á (U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT) is sorted differently from á (U+00E1 LATIN SMALL LETTER A WITH ACUTE) (in a UTF-8 locale, both end up equivalent to a in the first pass but the second pass sorts by code point).

Best Answer

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

How do get unix sort to sort in same order as Java (by unicode value)

Related Question