Sorting lexicographically by “all fields”

sort

Here is a very simple (though artificial) example to illustrate the problem. Suppose that I have a file /tmp/table, with the following contents:

px9xc
px12xc
pqx12xc
pqx9xc

Here, the x character is meant to be a field separator. Hence, the file contains a table of 4 rows and 3 columns. (Henceforth, I'll refer to the columns as "fields".)

I want to sort this table lexicographically with respect to the fields. By this I mean that if two rows have identical values in field 1, then the tie is broken by sorting according to field 2, and if the values in field 2 are also identical, then sort according to field 3.¹ NB: within each field, we assume sort's default ordering.

This means that, for the table in /tmp/table, the desired ordering is

px12xc
px9xc
pqx12xc
pqx9xc

(Note that in sort's default ordering, 12 comes before 9.)

A simple invocation of sort won't produce the desired ordering, because the field separators will not be interpreted as such:

% sort /tmp/table
pqx12xc
pqx9xc
px12xc
px9xc

This also fails to produce the desired ordering:

% sort -tx -k1,3 /tmp/table
pqx12xc
pqx9xc
px12xc
px9xc

The only way I've found to achieve the desired ordering using sort (at least the one installed on my system, namely GNU's) is this:

% sort -tx -k1,1 -k2,2 -k3,3 /tmp/table
px12xc
px9xc
pqx12xc
pqx9xc

The problem with this solution (aside from the tedium of specifying as many -k?,? options as there are fields) is that it doesn't generalize to tables with a different number of fields.

Is there a convenient way (either with sort or otherwise) to apply the field-based lexicographic ordering to "all the fields"?

^{¹More generally, if the table has N fields, to decide which of two rows comes first in the lexicographic order, one applies the following recursive rule, letting k range from 1 to N: if the two rows are identical in their values for fields 1 through k – 1, then break the tie according to the value in the k-th field.}

Best Answer

We can cheat and replace the separator character with another (eg the NUL character) that would allow "native" sorting to work, then set it back.

tr x '\000' < file.txt | sort | tr '\000' x

Now we may need to be careful of locale impact on ordering, so you may need to do

tr x '\000' < file.txt | LANG=C sort | tr '\000' x

This only works as long as the data has no NULs in it!

How this works may be easier explained with a formatted table:

We need to compare two rows:

ab  | def
abc | def

We set the separator to be NUL and we compare ab<NUL>def to abc<NUL>def. The <NUL> comes before c and so we've correctly sorted on the first field.

Now let's say the first field matches and the second field is different

abc | def
abc | ghi

Now we compare abc<NUL>def to abc<NUL>ghi. We've got a match through the first field and the separator and are now sorting on the second field.

Explanation

use List::Util 'shuffle'; : this imports a Perl module that provides the shuffle() function which randomizes an array.
@l=<>; : load the input file (<>) into the array @l.
for $i (1..10000){} : run this 10000 times.
@r=shuffle(0..$#l); : $#l is the number of elements in @l so @r is now a randomized list of the index numbers of the array @l (the input file's lines).
open(my $fh, ">","file.$i.bed"); : open a file called file.$i.bed for writing. $i will take values from 1 to 10000.
print $fh @l[@r[0..999]] : take the first 1000 indices in the shuffled array and print the corresponding lines (elements of @l).

Another approach is to use shuf (thanks @frostschutz):

$ time for i in {1..10000}; do shuf -n 1000 file.bed > file.$i.abed; done

real    1m9.743s
user    0m23.732s
sys     0m31.764s

Sort – Lexicographically Sort One Column and Numerically Sort Another

Yes, using the -k option to define sort keys, and the n option to specify numerical sorts:

$ echo -e "a 13\nb 2\na 2" | sort -k1,1 -k2,2n
a 2
a 13
b 2

Best Answer

Related Solutions

Sorting data faster approach

Explanation

Sort – Lexicographically Sort One Column and Numerically Sort Another

Related Question