LC_COLLATE Sort Order – Specify Sort Order to Place Lowercase Before Uppercase

linuxsort

Given the file:

$ cat file
1
a
C
B
2
c
3
A
b

By default sort will:

$ sort file
1
2
3
a
A
b
B
c
C

With LC_COLLATE=C so will sort in uppercase letter before lowercase:

$ LC_COLLATE=C sort file
1
2
3
A
B
C
a
b
c

Is it possible to get sort to reverse the case ordering, that is digits, lowercase then uppercase?

Best Answer

I don't know of any locales that, by default, sort in that order. The solution is to create a custom locale with a customized sort order. If anyone, four years later, wants to sort in a custom fashion, here's the trick.

The vast majority of locales don't specify their own sort order, but rather copy the sort order defined in /usr/share/i18n/locales/iso14651_t1_common so that is what you will want to edit. Rather than change the sort order for nearly every locale by modifying the original iso14651_t1_common, I suggest you make a copy. Details about how the sort order works and how to create a custom locale in your $HOME directory without root access are found in this answer to a similar question.

Take a look at how a and A are ordered based on their entries in iso14651_t1_common:

<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
<U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A

b and B are similar:

<U0062> <b>;<BAS>;<MIN>;IGNORE # 233 b
<U0042> <b>;<BAS>;<CAP>;IGNORE # 550 B

We see that on the first pass, both a and A have the collating symbol <a>, while both b and B have the collating symbol <b>. Since <a> appears before <b> in iso14651_t1_common, a and A are tied before b and B. The second pass doesn't break the ties because all four characters have the collating symbol <BAS>, but during the third pass the ties are resolved because the collating symbol for lowercase letters <MIN> appears on line 3467, before the collating symbol for uppercase letters <CAP> (line 3488). So the sort order ends up as a, A, b, B.

Swapping the first and third collating symbols would sort letters first by case (lower then upper), then by accent (<BAS> means non-accented), then by alphabetical order. However, both <MIN> and <CAP> come before the numeric digits, so this would have the unwanted effect of putting digits after letters.

The easiest way to keep digits first while making all lowercase letters come before all uppercase letters is to force all letters to tie during the first comparison by setting them all equal to <a>. To make sure that they sort alphabetically within case, change the last collating symbol from IGNORE to the current first collating symbol. Following this pattern, a would become:

<U0061> <a>;<BAS>;<MIN>;<a> # 198 a

A would become:

<U0041> <a>;<BAS>;<CAP>;<a> # 517 A

b would become:

<U0062> <a>;<BAS>;<MIN>;<b> # 233 b

B would become:

<U0042> <a>;<BAS>;<CAP>;<b> # 550 B

and so on for the rest of the letters.

Once you have created a customized version of iso14651_t1_common, follow the instructions in the answer linked above to compile your custom locale.

Related Question