Linux – Issues with Using sort and comm

commlinuxshellsort

I was trying to find the intersection of two plain data files, and found from a previous post that it can be done through

comm -12 <(sort test1.list) < (sort test2.list)

It seems to me that sort test1.list aims to sort test1.list in order. In order to understand how sort works, I tried sort against the following file, test1.list as sort test1.list > test2.list

100
-200
300
2
92
15
340

However, it turns out that test2.list is

100
15
2
-200
300
340
92

This re-ordered list make me quite confused about how this sort works, and how does sort and comm work together.

Best Answer

Per the comm manual, "Before `comm' can be used, the input files must be sorted using the collating sequence specified by the `LC_COLLATE' locale."

And the sort manual: "Unless otherwise specified, all comparisons use the character collating sequence specified by the `LC_COLLATE' locale.

Therefore, and a quick test confirms, the LC_COLLATE order comm expects is provided by the sort's default order, dictionary sort.

sort can sort files in a variety of manners:

  • -d: Dictionary order - ignores anything but whitespace and alphanumerics.
  • -g: General numeric - alpha, then negative numbers, then positive.
  • -h: Human-readable - negative, alpha, positive. n < nk = nK < nM < nG
  • -n: Numeric - negative, alpha, positive. k,M,G, etc. are not special.
  • -V: Version - positive, caps, lower, negative. 1 < 1.2 < 1.10
  • -f: Case-insensitive.
  • -R: Random - shuffle the input.
  • -r: Reverse - usually used with one of dghnV

There are other options, of course, but these are the ones you're likely to see or need.

Your test shows that the default sort order is probably -d, dictionary order.

  d   |   g   |   h   |   n   |   V 
------+-------+-------+-------+-------
  1   |  a    | -1G   | -10   |  1
 -1   |  A    | -1k   | -5    |  1G
  10  |  z    | -10   | -1    |  1g
 -10  |  Z    | -5    | -1g   |  1k
  1.10| -10   | -1    | -1G   |  1.2
  1.2 | -5    | -1g   | -1k   |  1.10
  1g  | -1    |  a    |  a    |  5
  1G  | -1g   |  A    |  A    |  10
 -1g  | -1G   |  z    |  z    |  A
 -1G  | -1k   |  Z    |  Z    |  Z
  1k  |  1    |  1    |  1    |  a
 -1k  |  1g   |  1g   |  1g   |  z
  5   |  1G   |  1.10 |  1G   | -1
 -5   |  1k   |  1.2  |  1k   | -1G
  a   |  1.10 |  5    |  1.10 | -1g
  A   |  1.2  |  10   |  1.2  | -1k
  z   |  5    |  1k   |  5    | -5
  Z   |  10   |  1G   |  10   | -10
Related Question