Centos – Sort command inconsistent behavior

centoslocalesort

I want to sort two files but I cannot get consistent results. It seems there are problems with collation but I cannot understand the reason. In sample files separator is a single space:

file1:

a
b
B
A

file2:

a 1
b 0
B 1
A 0

I use sort -k1,1 to sort these files and the output is:

sorted1:

a
A
b
B

sorted2:

A 0
a 1
b 0
B 1

I need those sorted files in a join and its currently complaining that the one of files is not sorted.

In my environment LC_COLLATE and LC_ALL are not set, LANG is set to en_US.UTF-8

With LC_ALL=C sort -k1,1 the output is:

sorted11:

A
B
a
b

sorted22:

A 0
B 1
a 1
b 0

I don't need a specific ordering, I just want it to be able to join the results. This way join works. To be safe I can also prepend join with LC_ALL=C.

My question

Why in sorted1 a is before A and in sorted2 a is after A? Whatever the collation is, it is for both sort commands and I am sorting based on column 1 that is identical in both input files.

Added output of ltrace -e strcoll

file1

sort->strcoll("B","A") =1
sort->strcoll("a","b") =-1 
sort->strcoll("a","A") =-7
a
sort->strcoll("b","A") =1
A
sort->strcoll("b","B") =-7
b
B
+++ exited (status 0) +++

file2

sort->strcoll("B 1","A 0") =1
sort->strcoll("a 1","b 0") =-1 
sort->strcoll("a 1","A 0") =1
A 0
sort->strcoll("a 1","B 1) =-1
a 1
sort->strcoll("b 0","B 1") =-1
b 0
B 1
+++ exited (status 0) +++

Best Answer

As Stéphane Chazelas said in the comment, it is a bug in the specific implementation of coreutils (in coreutils-8.22-11.el7) by CentOS/Red Hat, more specifically in the buggy internationalisation patch (coreutils-i18n.patch) they wrote and applied on top of GNU's coreutils-8.22.

I reported it here to CentOS and also here to Red Hat. It was already known at Red Hat and fixed there in coreutils-8.22-13.el7.

That one is not available yet for CentOS at this time (2015-08-20).

For completeness, note that the bug was also (incorrectly as the bug was not there) reported upstreams (at GNU's) where you'll find some more information about it.

Related Question