Sorting lexicographically by “all fields”

sort

Here is a very simple (though artificial) example to illustrate the problem. Suppose that I have a file /tmp/table, with the following contents:

px9xc
px12xc
pqx12xc
pqx9xc

Here, the x character is meant to be a field separator. Hence, the file contains a table of 4 rows and 3 columns. (Henceforth, I'll refer to the columns as "fields".)

I want to sort this table lexicographically with respect to the fields. By this I mean that if two rows have identical values in field 1, then the tie is broken by sorting according to field 2, and if the values in field 2 are also identical, then sort according to field 3.1 NB: within each field, we assume sort's default ordering.

This means that, for the table in /tmp/table, the desired ordering is

px12xc
px9xc
pqx12xc
pqx9xc

(Note that in sort's default ordering, 12 comes before 9.)

A simple invocation of sort won't produce the desired ordering, because the field separators will not be interpreted as such:

% sort /tmp/table
pqx12xc
pqx9xc
px12xc
px9xc

This also fails to produce the desired ordering:

% sort -tx -k1,3 /tmp/table
pqx12xc
pqx9xc
px12xc
px9xc

The only way I've found to achieve the desired ordering using sort (at least the one installed on my system, namely GNU's) is this:

% sort -tx -k1,1 -k2,2 -k3,3 /tmp/table
px12xc
px9xc
pqx12xc
pqx9xc

The problem with this solution (aside from the tedium of specifying as many -k?,? options as there are fields) is that it doesn't generalize to tables with a different number of fields.

Is there a convenient way (either with sort or otherwise) to apply the field-based lexicographic ordering to "all the fields"?


1More generally, if the table has N fields, to decide which of two rows comes first in the lexicographic order, one applies the following recursive rule, letting k range from 1 to N: if the two rows are identical in their values for fields 1 through k – 1, then break the tie according to the value in the k-th field.

Best Answer

We can cheat and replace the separator character with another (eg the NUL character) that would allow "native" sorting to work, then set it back.

eg

tr x '\000' < file.txt | sort | tr '\000' x

Now we may need to be careful of locale impact on ordering, so you may need to do

tr x '\000' < file.txt | LANG=C sort | tr '\000' x

This only works as long as the data has no NULs in it!

How this works may be easier explained with a formatted table:

We need to compare two rows:

ab  | def
abc | def

We set the separator to be NUL and we compare ab<NUL>def to abc<NUL>def. The <NUL> comes before c and so we've correctly sorted on the first field.

Now let's say the first field matches and the second field is different

abc | def
abc | ghi

Now we compare abc<NUL>def to abc<NUL>ghi. We've got a match through the first field and the separator and are now sorting on the second field.

Related Question