How do get unix sort to sort in same order as Java (by unicode value)

javasortunicode

I shell out sorting to the unix sort command in a Java program I've written. However I am having problems arising from Java's string comparison behaving differently than the comparisons done by sort.

From the [Java Doc][1]:

Compares two strings lexicographically. The comparison is based on the
Unicode value of each character in the strings.

From the sort man page:

* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.

So my guess is need to sort with LC_ALL=C. However I always thought this meant sort based on ASCII value, which means who knows what could happen with unicode.

Best Answer

The LC_COLLATE locale category controls the sorting order. LC_ALL sets all categories.

With LC_COLLATE=C, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. Therefore LC_COLLATE=C is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.

Note that Java does not actually sort according to the Unicode character values but according to their UTF-16 encoding. This makes a difference with surrogate pairs, i.e. if you have code points above 65535.

Neither UTF-8 byte representation sorting nor Java sorting nor the sort utility in a UTF-8 locale on GNU/Linux take combining characters into account, e.g. (U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT) is sorted differently from á (U+00E1 LATIN SMALL LETTER A WITH ACUTE) (in a UTF-8 locale, both end up equivalent to a in the first pass but the second pass sorts by code point).

Related Question