I want to sort two files but I cannot get consistent results. It seems there are problems with collation but I cannot understand the reason. In sample files separator is a single space:
file1:
a
b
B
A
file2:
a 1
b 0
B 1
A 0
I use sort -k1,1
to sort these files and the output is:
sorted1:
a
A
b
B
sorted2:
A 0
a 1
b 0
B 1
I need those sorted files in a join
and its currently complaining that the one of files is not sorted.
In my environment LC_COLLATE
and LC_ALL
are not set, LANG
is set to en_US.UTF-8
With LC_ALL=C sort -k1,1
the output is:
sorted11:
A
B
a
b
sorted22:
A 0
B 1
a 1
b 0
I don't need a specific ordering, I just want it to be able to join the results. This way join
works. To be safe I can also prepend join
with LC_ALL=C
.
My question
Why in sorted1
a
is before A
and in sorted2
a
is after A
? Whatever the collation is, it is for both sort
commands and I am sorting based on column 1 that is identical in both input files.
Added output of ltrace -e strcoll
file1
sort->strcoll("B","A") =1
sort->strcoll("a","b") =-1
sort->strcoll("a","A") =-7
a
sort->strcoll("b","A") =1
A
sort->strcoll("b","B") =-7
b
B
+++ exited (status 0) +++
file2
sort->strcoll("B 1","A 0") =1
sort->strcoll("a 1","b 0") =-1
sort->strcoll("a 1","A 0") =1
A 0
sort->strcoll("a 1","B 1) =-1
a 1
sort->strcoll("b 0","B 1") =-1
b 0
B 1
+++ exited (status 0) +++
Best Answer
As Stéphane Chazelas said in the comment, it is a bug in the specific implementation of
coreutils
(incoreutils-8.22-11.el7
) by CentOS/Red Hat, more specifically in the buggy internationalisation patch (coreutils-i18n.patch
) they wrote and applied on top of GNU'scoreutils-8.22
.I reported it here to CentOS and also here to Red Hat. It was already known at Red Hat and fixed there in
coreutils-8.22-13.el7
.That one is not available yet for CentOS at this time (2015-08-20).
For completeness, note that the bug was also (incorrectly as the bug was not there) reported upstreams (at GNU's) where you'll find some more information about it.