How do get unix sort to sort in same order as Java (by unicode value)

javasortunicode

I shell out sorting to the unix sort command in a Java program I've written. However I am having problems arising from Java's string comparison behaving differently than the comparisons done by sort.

From the [Java Doc][1]:

Compares two strings lexicographically. The comparison is based on the
Unicode value of each character in the strings.

From the sort man page:

* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.

So my guess is need to sort with LC_ALL=C. However I always thought this meant sort based on ASCII value, which means who knows what could happen with unicode.

Best Answer

The LC_COLLATE locale category controls the sorting order. LC_ALL sets all categories.

With LC_COLLATE=C, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. Therefore LC_COLLATE=C is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.

Note that Java does not actually sort according to the Unicode character values but according to their UTF-16 encoding. This makes a difference with surrogate pairs, i.e. if you have code points above 65535.

Neither UTF-8 byte representation sorting nor Java sorting nor the sort utility in a UTF-8 locale on GNU/Linux take combining characters into account, e.g. á (U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT) is sorted differently from á (U+00E1 LATIN SMALL LETTER A WITH ACUTE) (in a UTF-8 locale, both end up equivalent to a in the first pass but the second pass sorts by code point).

Related Solutions

Sort lines by unicode value

What system are you using?

LC_ALL=C sort < your-file.txt

Where your-file.txt is the text you posted in utf-8 encoding, sorts as:

[#ゆうかりんちゃんねる]
[10th Avenue Cafe]
[2nd Flush]
[ALTERNATIVE]
[Alstroemeria Records & Cradle]
[Amateras Records]
[Analyze]
[Z.S.G TRAXXX]
[anagram]
[α music]
[Яiselied]
[ぞめ]
[ほねとかわとがはなれるおと]
[アルトノイラント - Altneuland]
[サディスティックブラウニー]
[セブンスヘブンAmmy's]
[チ→ム♂ツナギ]
[一人華飯スペシャル]
[七瀬屋]

On my system (sort from GNU coreutils 8.13, Debian EGLIBC 2.13-38). Which when piped to cut -c2 | tr -d \\n | recode ..dump gives:

UCS2   Mne   Description

0023   Nb    number sign
0031   1     digit one
0032   2     digit two
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
0041   A     latin capital letter a
005A   Z     latin capital letter z
0061   a     latin small letter a
03B1   a*    greek small letter alpha
042F   JA    cyrillic capital letter ya
305E   zo    hiragana letter zo
307B   ho    hiragana letter ho
30A2   A6    katakana letter a
30B5   Sa    katakana letter sa
30BB   Se    katakana letter se
30C1   Ti    katakana letter ti
4E00
4E03

Same on an older system with sort from GNU coreutils 7.4, EGLIBC 2.11.1-0ubuntu7.12

Sort order explanation

I recommend to use rather

sort -V data.txt

-V stands for "version sort" and it basically handles correctly both alphabetical and numerical characters, so that if you would have more files, say:

f1.txt
f10.txt
f2.txt
a1.txt
a10.txt
a2.txt

then sort -V will give you

a1.txt
a2.txt
a10.txt
f1.txt
f2.txt
f10.txt

whereas sort -k 1.2n or sort -n -k 1.2:

a1.txt
f1.txt
a2.txt
f2.txt
a10.txt
f10.txt

Best Answer

Related Solutions

Sort lines by unicode value

Sort order explanation

Related Question