I shell out sorting to the unix sort command in a Java program I've written. However I am having problems arising from Java's string comparison behaving differently than the comparisons done by sort.
From the [Java Doc][1]:
Compares two strings lexicographically. The comparison is based on the
Unicode value of each character in the strings.
From the sort man page:
* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
So my guess is need to sort with LC_ALL=C. However I always thought this meant sort based on ASCII value, which means who knows what could happen with unicode.
Best Answer
The
LC_COLLATE
locale category controls the sorting order.LC_ALL
sets all categories.With
LC_COLLATE=C
, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. ThereforeLC_COLLATE=C
is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.Note that Java does not actually sort according to the Unicode character values but according to their UTF-16 encoding. This makes a difference with surrogate pairs, i.e. if you have code points above 65535.
Neither UTF-8 byte representation sorting nor Java sorting nor the
sort
utility in a UTF-8 locale on GNU/Linux take combining characters into account, e.g.á
(U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT) is sorted differently fromá
(U+00E1 LATIN SMALL LETTER A WITH ACUTE) (in a UTF-8 locale, both end up equivalent toa
in the first pass but the second pass sorts by code point).