Does the linux sort command differentiate between C and C.utf-8 locale?
the sort manual says to use LC_ALL=C to sort by byte value, but i saw that C.utf-8 allows also utf8 values (and not just ASCII) – but the sort manual doesn't refer to this locale option at all.
I don't see any difference between the two when running LC_ALL=C sort file.txt and LC_ALL=C.utf8 sort file.txt whether or not the file has utf 8 characters, both seem to work.
So is there any known difference?
Best Answer
LC_ALL=C sort
sorts by byte value. It will sort any input written in any charset by byte value, not only ASCII¹.The UTF-8 encoding has that property that sorting by byte value is the same as sorting by Unicode code point (
memcmp()
will find the encoding of U+1234 is greater than that of U+1233 or any Unicode code point less than 0x1234).C.utf-8
,C.utf8
orC.UTF-8
(the latter being more common in my experience) are not locales standardized by POSIX, but wherever they're found, they're meant to be locales that have most of the properties of the C locale except that the charset is UTF-8.LC_ALL=C.UTF-8 sort
would sort the input based on code point, but could end up decoding the UTF-8 before comparison or invoke thestrcoll()
/strxfrm()
heavy machinery which would end up being wasted effort given that for UTF-8, usingmemcmp()
is enough for that.With GNU
sort
and GNUlibc
as found on many non-embedded OSes that use Linux as their kernel (here also adding NUL characters in the input which GNUsort
supports even thoughstrcoll()
doesn't):(actually, you'll find that if the two strings to compare have the same number of bytes, GNU
sort
callsmemcmp()
first before callingstrcoll()
in case they are identical, asmemcmp()
is so cheap compared tostrcoll()
).Some timings on that output repeated 1,000,000 times:
So to sort UTF-8 encoded text by codepoint, using
C
orC.UTF-8
will make no different functionally, but usingC
may be more efficient depending on thesort
implementation.Now, not all sequences of bytes form valid UTF-8, so when it comes to non-UTF-8 input, that is input that contains sequences of bytes that can't be decoded as UTF-8, you may find the behaviour differs between
C
andC.UTF-8
. Still on a GNU system:(where � is my terminal emulator's rendition of unknown things)
In C.UTF-8,
strcoll()
returns 0 on those two strings that don't form valid UTF-8 text, in effect reporting that they have the same sorting order.In the C locale, any line made of sequence of bytes other than 0 and not longer than
LINE_MAX
bytes is valid text. In C.UTF-8, there are further restrictions. Thata\200b
is not valid in UTF-8, so it's not text, so as per POSIX, the behaviour ofsort
on it is unspecified.As a side note: on GNU systems, while
LC_ALL=C
takes precedence over$LANGUAGE
for the language of the messages,LC_ALL=C.UTF-8
doesn't.¹ also note that the
C
locale charset doesn't have to be based on ASCII and that ASCII only covers values 0 to 127.C
locales that use ASCII still consider bytes 128 to 255 as characters, albeit undefined characters. TheC
locale charset has to guarantee one byte per character though, so theC
locale charset cannot be UTF-8