Here is a very simple (though artificial) example to illustrate the problem. Suppose that I have a file /tmp/table
, with the following contents:
px9xc
px12xc
pqx12xc
pqx9xc
Here, the x
character is meant to be a field separator. Hence, the file contains a table of 4 rows and 3 columns. (Henceforth, I'll refer to the columns as "fields".)
I want to sort this table lexicographically with respect to the fields. By this I mean that if two rows have identical values in field 1, then the tie is broken by sorting according to field 2, and if the values in field 2 are also identical, then sort according to field 3.1 NB: within each field, we assume sort
's default ordering.
This means that, for the table in /tmp/table
, the desired ordering is
px12xc
px9xc
pqx12xc
pqx9xc
(Note that in sort
's default ordering, 12
comes before 9
.)
A simple invocation of sort
won't produce the desired ordering, because the field separators will not be interpreted as such:
% sort /tmp/table
pqx12xc
pqx9xc
px12xc
px9xc
This also fails to produce the desired ordering:
% sort -tx -k1,3 /tmp/table
pqx12xc
pqx9xc
px12xc
px9xc
The only way I've found to achieve the desired ordering using sort
(at least the one installed on my system, namely GNU's) is this:
% sort -tx -k1,1 -k2,2 -k3,3 /tmp/table
px12xc
px9xc
pqx12xc
pqx9xc
The problem with this solution (aside from the tedium of specifying as many -k?,?
options as there are fields) is that it doesn't generalize to tables with a different number of fields.
Is there a convenient way (either with sort
or otherwise) to apply the field-based lexicographic ordering to "all the fields"?
1More generally, if the table has N fields, to decide which of two rows comes first in the lexicographic order, one applies the following recursive rule, letting k range from 1 to N: if the two rows are identical in their values for fields 1 through k – 1, then break the tie according to the value in the k-th field.
Best Answer
We can cheat and replace the separator character with another (eg the NUL character) that would allow "native" sorting to work, then set it back.
eg
Now we may need to be careful of locale impact on ordering, so you may need to do
This only works as long as the data has no NULs in it!
How this works may be easier explained with a formatted table:
We need to compare two rows:
We set the separator to be NUL and we compare
ab<NUL>def
toabc<NUL>def
. The<NUL>
comes beforec
and so we've correctly sorted on the first field.Now let's say the first field matches and the second field is different
Now we compare
abc<NUL>def
toabc<NUL>ghi
. We've got a match through the first field and the separator and are now sorting on the second field.