I was trying to find the intersection of two plain data files, and found from a previous post that it can be done through
comm -12 <(sort test1.list) < (sort test2.list)
It seems to me that sort test1.list
aims to sort test1.list in order. In order to understand how sort
works, I tried sort
against the following file, test1.list as sort test1.list > test2.list
100
-200
300
2
92
15
340
However, it turns out that test2.list is
100
15
2
-200
300
340
92
This re-ordered list make me quite confused about how this sort works, and how does sort and comm work together.
Best Answer
Per the
comm
manual, "Before `comm' can be used, the input files must be sorted using the collating sequence specified by the `LC_COLLATE' locale."And the
sort
manual: "Unless otherwise specified, all comparisons use the character collating sequence specified by the `LC_COLLATE' locale.Therefore, and a quick test confirms, the
LC_COLLATE
ordercomm
expects is provided by thesort
's default order, dictionary sort.sort
can sort files in a variety of manners:-d
: Dictionary order - ignores anything but whitespace and alphanumerics.-g
: General numeric - alpha, then negative numbers, then positive.-h
: Human-readable - negative, alpha, positive.n < nk = nK < nM < nG
-n
: Numeric - negative, alpha, positive.k
,M
,G
, etc. are not special.-V
: Version - positive, caps, lower, negative.1 < 1.2 < 1.10
-f
: Case-insensitive.-R
: Random - shuffle the input.-r
: Reverse - usually used with one ofdghnV
There are other options, of course, but these are the ones you're likely to see or need.
Your test shows that the default sort order is probably
-d
, dictionary order.