Sort – Why Does ‘sort’ Produce Output in Weird Order?

bashcoreutilssort

Consider the following input to sort:

cat > foo <<EOM
D,,5014978
DD,,25
D,I,1972765530
D,Y,4223624
-,Y,71285059
YA,I,2
EOM

Now try running sort foo

The output is not sorted when trying this on any of my linux boxes (gnu coreutils versions 6.9-7.4). The output is sorted when run under cygwin (gnu coretuils 8.5). Comments?

Best Answer

Sorting depends on the locale; specifically, it depends on $LC_COLLATE (possibly overridden by $LC_ALL), falling back to $LANG if it doesn't exist. The command locale will show you what values you're effectively working with. See man 3 strcoll, man 3 setlocale, etc.

LC_COLLATE=C (or POSIX or no locale at all) results in a strict byte-by-byte comparison.

LC_COLLATE=en_US.utf8 results in an alphabetical-equivalence sort, with punctuation ignored and characters within the same equivalence class treated equally.

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

Short version: collation doesn't really work in command line utilities.

Longer version: the underlying function to compare two strings is strcoll. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm constructs this canonical form.

Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):

$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b

As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8 locale. They are, however, present in a Japanese locale.

$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇 
〼 303030
〇 3c9b

The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US, which includes /usr/share/i18n/locales/iso14651_t1_common. This file doesn't have an entry for U3007 or U303C, nor are they included in any range that I can find.

I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is

The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.

It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.

Sorting with Perl respecting Locale Settings

The problem is local $/ = undef. It causes perl to read entire file in to @ARGV array, meaning it contains only one element, so sort can not sort it (because you are sorting an array with only one element). I expect the output must be the same with your beginning data (I also use Ubuntu 12.04 LTS, perl version 5.14.2:

$ perl -le 'local $/ = undef;print ++$i for <>' < cat
1

$ perl -le 'print ++$i for <>' < cat
1
2
3
4
5
6
7
8
9

If you remove local $/ = undef, perl sort will proceduce same output with the shell sort with LC_ALL=C:

$ perl -e 'print sort <>' < data
Uber
peach
péché
pêche
sin
war
wird
wär
Über

Note

Without use locale, perl ignores your current locale settings. Perl comparison operators ("lt", "le", "cmp", "ge", and "gt") use LC_COLLATE (when LC_ALL absented), and sort is also effected because it use cmp by default.

You can get current LC_COLLATE value:

$ perl -MPOSIX=setlocale -le 'print setlocale(LC_COLLATE)'
en_US.UTF-8

Best Answer

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

Sorting with Perl respecting Locale Settings

Related Question