GNU utility: sort

sort

I have an issue sorting a file based on the first two columns.

The layout of the file is:

1 998688068 PizzaFan Insurance 22.47
5 072821325 Plaisio Computers 26.35
4 998688068 PizzaFan Food 27.32
5 456834578 G.Yannopoulos Medical 91.67

….
….

I used this command :
sort -n -k 1,2 "$fpath" -o "$fpath.ordered"

The sort result is:

1 473151252 Goodys Food 7.15
1 951515524 Atlantic SuperMarket 41.32
1 998688068 Atlantic SuperMarket 80.23
1 998688068 PizzaFan Food 61.72
1 998688068 PizzaFan Insurance 22.47
2 094321587 Vasilopoulos SuperMarket 6.50

….
….

I don't understand why all columns get sorted (see 3rd column & PizzaFan Insurance)

I think -k 1,2 means sort on column 1 and resolve ties with column 2, but it's like it doesn't work.

It's the same as using the:
sort -n "$fpath" -o "$fpath.ordered"

Best Answer

If you want a stable sort (relative order of the input rows is preserved in case of ties), you need to use the -s or --stable flag.

Related Solutions

File Sorting – How to Sort Based on the Third Column

sort -k 3,3 myFile

would display the file sorted by the 3^rd column assuming the columns are separated by sequences of blanks (ASCII SPC and TAB characters in the POSIX/C locale), according to the sort order defined by the current locale.

Note that the leading blanks are included in the column (the default separator is the transition from a non-blank to a blank), that can make a difference in locales where spaces are not ignored for the purpose of comparison, use the -b option to ignore the leading blanks.

Note that it's completely independent from the shell (all the shells would parse that command line the same, shells generally don't have the sort command built in).

-k 3 is to sort on the portion of the lines starting with the 3^rd column (including the leading blanks). In the C locale, because the space and tab characters ranks before all the printable characters, that will generally give you the same result as -k 3,3 (except for lines that have an identical third field),

-u is to retain only one of the lines if there are several that sort identically (that is where the sort key sorts the same (that's not necessarily the same as being equal)).

cat is the command to concatenate. You don't need it here.

If the columns are separated by something else, you need the -t option to specify the separator.

Given example file a

$ cat a
a c c c
a b ca d
a b  c e
a b c d

With -u -k 3:

$ echo $LANG
en_GB.UTF-8

$ sort -u -k 3 a
a b ca d
a c c c
a b c d
a b  c e

Line 2 and 3 have the same third column, but here the sort key is from the third column to the end of line, so -u retains both. ␠ca␠d sorts before ␠c␠c because spaces are ignored in the first pass in my locale, cad sorts before cc.

$ sort -u -k 3,3 a
a b c d
a b  c e
a b ca d

Above only one is retained for those where the 3rd column is ␠c. Note how the one with ␠␠c (2 leading spaces) is retained.

$ sort -k 3 a
a b ca d
a c c c
a b c d
a b  c e
$ sort -k 3,3 a
a b c d
a c c c
a b  c e
a b ca d

See how the order of a b c d and a c c c are reversed. In the first case, because ␠c␠c sorts before ␠c␠d, in the second case because the sort key is the same (␠c), the last resort comparison that compares the lines in full puts a b c d before a c c c.

$ sort -b -k 3,3 a
a b c d
a b  c e
a c c c
a b ca d

Once we ignore the blanks, the sort key for the first 3 lines is the same (c), so they are sorted by the last resort comparison.

$ LC_ALL=C sort -k 3 a
a b  c e
a c c c
a b c d
a b ca d
$ LC_ALL=C sort -k 3,3 a
a b  c e
a b c d
a c c c
a b ca d

In the C locale, ␠␠c sorts before ␠c as there is only one pass there where characters (then single bytes) sort based on their code point value (where space has a lower code point than c).

Centos – Sort command inconsistent behavior

As Stéphane Chazelas said in the comment, it is a bug in the specific implementation of coreutils (in coreutils-8.22-11.el7) by CentOS/Red Hat, more specifically in the buggy internationalisation patch (coreutils-i18n.patch) they wrote and applied on top of GNU's coreutils-8.22.

I reported it here to CentOS and also here to Red Hat. It was already known at Red Hat and fixed there in coreutils-8.22-13.el7.

That one is not available yet for CentOS at this time (2015-08-20).

For completeness, note that the bug was also (incorrectly as the bug was not there) reported upstreams (at GNU's) where you'll find some more information about it.

Best Answer

Related Solutions

File Sorting – How to Sort Based on the Third Column

Centos – Sort command inconsistent behavior

Related Question