Sort LC_ALL=C vs LC_ALL=C.utf8 – Differences

localesortunicode

Does the linux sort command differentiate between C and C.utf-8 locale?

the sort manual says to use LC_ALL=C to sort by byte value, but i saw that C.utf-8 allows also utf8 values (and not just ASCII) – but the sort manual doesn't refer to this locale option at all.

I don't see any difference between the two when running LC_ALL=C sort file.txt and LC_ALL=C.utf8 sort file.txt whether or not the file has utf 8 characters, both seem to work.

So is there any known difference?

Best Answer

LC_ALL=C sort sorts by byte value. It will sort any input written in any charset by byte value, not only ASCII¹.

The UTF-8 encoding has that property that sorting by byte value is the same as sorting by Unicode code point (memcmp() will find the encoding of U+1234 is greater than that of U+1233 or any Unicode code point less than 0x1234).

C.utf-8, C.utf8 or C.UTF-8 (the latter being more common in my experience) are not locales standardized by POSIX, but wherever they're found, they're meant to be locales that have most of the properties of the C locale except that the charset is UTF-8.

LC_ALL=C.UTF-8 sort would sort the input based on code point, but could end up decoding the UTF-8 before comparison or invoke the strcoll()/strxfrm() heavy machinery which would end up being wasted effort given that for UTF-8, using memcmp() is enough for that.

With GNU sort and GNU libc as found on many non-embedded OSes that use Linux as their kernel (here also adding NUL characters in the input which GNU sort supports even though strcoll() doesn't):

$ printf 'a\0£1\na\0€2\n' | LC_ALL=C ltrace -e strcoll -e memcmp sort
sort->memcmp("a\0\302\2431", "a\0\342\202\254", 5)                      = -1
a£1
a€2
$ printf 'a\0£1\na\0€2\n' | LC_ALL=C.UTF-8 ltrace -e strcoll -e memcmp sort
sort->strcoll("a", "a")                                                 = 0
sort->strcoll("\302\2431", "\342\202\2542")                             = -31
a£1
a€2

(actually, you'll find that if the two strings to compare have the same number of bytes, GNU sort calls memcmp() first before calling strcoll() in case they are identical, as memcmp() is so cheap compared to strcoll()).

Some timings on that output repeated 1,000,000 times:

$ printf 'a\0£1\na\0€2\n%.0s' {1..1000000} > file.test
$ wc -mc file.test
10000000 13000000 file.test
$ time LC_ALL=C sort file.test > /dev/null
LC_ALL=C sort file.test > /dev/null  0.74s user 0.06s system 390% cpu 0.205 total
$ time LC_ALL=C.UTF-8 sort file.test > /dev/null
LC_ALL=C.UTF-8 sort file.test > /dev/null  6.04s user 0.12s system 522% cpu 1.179 total

So to sort UTF-8 encoded text by codepoint, using C or C.UTF-8 will make no different functionally, but using C may be more efficient depending on the sort implementation.

Now, not all sequences of bytes form valid UTF-8, so when it comes to non-UTF-8 input, that is input that contains sequences of bytes that can't be decoded as UTF-8, you may find the behaviour differs between C and C.UTF-8. Still on a GNU system:

$ print -l 'a\200b' 'a\201b' | LC_ALL=C sort -u
a�b
a�b
$ print -l 'a\200b' 'a\201b' | LC_ALL=C.UTF-8 sort -u
a�b

(where � is my terminal emulator's rendition of unknown things)

In C.UTF-8, strcoll() returns 0 on those two strings that don't form valid UTF-8 text, in effect reporting that they have the same sorting order.

In the C locale, any line made of sequence of bytes other than 0 and not longer than LINE_MAX bytes is valid text. In C.UTF-8, there are further restrictions. That a\200b is not valid in UTF-8, so it's not text, so as per POSIX, the behaviour of sort on it is unspecified.

As a side note: on GNU systems, while LC_ALL=C takes precedence over $LANGUAGE for the language of the messages, LC_ALL=C.UTF-8 doesn't.

$ LC_ALL=C LANGUAGE=fr:es:en sort /
sort: read failed: /: Is a directory
$ LC_ALL=C.UTF-8 LANGUAGE=fr:es:en sort /
sort: échec de lecture: /: est un dossier

¹ also note that the C locale charset doesn't have to be based on ASCII and that ASCII only covers values 0 to 127. C locales that use ASCII still consider bytes 128 to 255 as characters, albeit undefined characters. The C locale charset has to guarantee one byte per character though, so the C locale charset cannot be UTF-8