Sort LC_ALL=C vs LC_ALL=C.utf8 – Differences

localesortunicode

Does the linux sort command differentiate between C and C.utf-8 locale?

the sort manual says to use LC_ALL=C to sort by byte value, but i saw that C.utf-8 allows also utf8 values (and not just ASCII) – but the sort manual doesn't refer to this locale option at all.

I don't see any difference between the two when running LC_ALL=C sort file.txt and LC_ALL=C.utf8 sort file.txt whether or not the file has utf 8 characters, both seem to work.

So is there any known difference?

Best Answer

LC_ALL=C sort sorts by byte value. It will sort any input written in any charset by byte value, not only ASCII¹.

The UTF-8 encoding has that property that sorting by byte value is the same as sorting by Unicode code point (memcmp() will find the encoding of U+1234 is greater than that of U+1233 or any Unicode code point less than 0x1234).

C.utf-8, C.utf8 or C.UTF-8 (the latter being more common in my experience) are not locales standardized by POSIX, but wherever they're found, they're meant to be locales that have most of the properties of the C locale except that the charset is UTF-8.

LC_ALL=C.UTF-8 sort would sort the input based on code point, but could end up decoding the UTF-8 before comparison or invoke the strcoll()/strxfrm() heavy machinery which would end up being wasted effort given that for UTF-8, using memcmp() is enough for that.

With GNU sort and GNU libc as found on many non-embedded OSes that use Linux as their kernel (here also adding NUL characters in the input which GNU sort supports even though strcoll() doesn't):

$ printf 'a\0£1\na\0€2\n' | LC_ALL=C ltrace -e strcoll -e memcmp sort
sort->memcmp("a\0\302\2431", "a\0\342\202\254", 5)                      = -1
a£1
a€2
$ printf 'a\0£1\na\0€2\n' | LC_ALL=C.UTF-8 ltrace -e strcoll -e memcmp sort
sort->strcoll("a", "a")                                                 = 0
sort->strcoll("\302\2431", "\342\202\2542")                             = -31
a£1
a€2

(actually, you'll find that if the two strings to compare have the same number of bytes, GNU sort calls memcmp() first before calling strcoll() in case they are identical, as memcmp() is so cheap compared to strcoll()).

Some timings on that output repeated 1,000,000 times:

$ printf 'a\0£1\na\0€2\n%.0s' {1..1000000} > file.test
$ wc -mc file.test
10000000 13000000 file.test
$ time LC_ALL=C sort file.test > /dev/null
LC_ALL=C sort file.test > /dev/null  0.74s user 0.06s system 390% cpu 0.205 total
$ time LC_ALL=C.UTF-8 sort file.test > /dev/null
LC_ALL=C.UTF-8 sort file.test > /dev/null  6.04s user 0.12s system 522% cpu 1.179 total

So to sort UTF-8 encoded text by codepoint, using C or C.UTF-8 will make no different functionally, but using C may be more efficient depending on the sort implementation.

Now, not all sequences of bytes form valid UTF-8, so when it comes to non-UTF-8 input, that is input that contains sequences of bytes that can't be decoded as UTF-8, you may find the behaviour differs between C and C.UTF-8. Still on a GNU system:

$ print -l 'a\200b' 'a\201b' | LC_ALL=C sort -u
a�b
a�b
$ print -l 'a\200b' 'a\201b' | LC_ALL=C.UTF-8 sort -u
a�b

(where � is my terminal emulator's rendition of unknown things)

In C.UTF-8, strcoll() returns 0 on those two strings that don't form valid UTF-8 text, in effect reporting that they have the same sorting order.

In the C locale, any line made of sequence of bytes other than 0 and not longer than LINE_MAX bytes is valid text. In C.UTF-8, there are further restrictions. That a\200b is not valid in UTF-8, so it's not text, so as per POSIX, the behaviour of sort on it is unspecified.

As a side note: on GNU systems, while LC_ALL=C takes precedence over $LANGUAGE for the language of the messages, LC_ALL=C.UTF-8 doesn't.

$ LC_ALL=C LANGUAGE=fr:es:en sort /
sort: read failed: /: Is a directory
$ LC_ALL=C.UTF-8 LANGUAGE=fr:es:en sort /
sort: échec de lecture: /: est un dossier

^{¹ also note that the C locale charset doesn't have to be based on ASCII and that ASCII only covers values 0 to 127. C locales that use ASCII still consider bytes 128 to 255 as characters, albeit undefined characters. The C locale charset has to guarantee one byte per character though, so the C locale charset cannot be UTF-8}

Related Solutions

Does (should) LC_COLLATE affect character ranges

If you are using anything other than the C locale, you shouldn't be using ranges like [a-z] since these are locale-dependent and don't always give the results you would expect. As well as the case issue you've already encountered, some locales treat characters with diacritics (eg á) the same as the base character (ie a).

Instead, use a named character class:

echo B | grep '[[:lower:]]'

This will always give the correct result for the locale. However, you need to choose the locale to reflect the meaning of both your input text and the test you are trying to apply.

For example, if you need to find a particular byte value, use the C locale, which is always available:

echo B | LANG=C grep '[a-z]'

If this doesn't work as expected, it really is a bug.

How do get unix sort to sort in same order as Java (by unicode value)

The LC_COLLATE locale category controls the sorting order. LC_ALL sets all categories.

With LC_COLLATE=C, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. Therefore LC_COLLATE=C is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.

Note that Java does not actually sort according to the Unicode character values but according to their UTF-16 encoding. This makes a difference with surrogate pairs, i.e. if you have code points above 65535.

Neither UTF-8 byte representation sorting nor Java sorting nor the sort utility in a UTF-8 locale on GNU/Linux take combining characters into account, e.g. á (U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT) is sorted differently from á (U+00E1 LATIN SMALL LETTER A WITH ACUTE) (in a UTF-8 locale, both end up equivalent to a in the first pass but the second pass sorts by code point).

Best Answer

Related Solutions

Does (should) LC_COLLATE affect character ranges

How do get unix sort to sort in same order as Java (by unicode value)

Related Question