Sort not working for similar entries

sort

I am sorting a file prior to joining it with another file, using

sort -k1 file1 > file1_sort

When I try to join with the second file, I get an error saying file1 is not sorted. I think this is occurring because of the following entry:

chr6_32609371_I I2 D 
chr6_32609371 T C

The "chr6_32609371" line needs to be placed before the "chr6_32609371_I" in my sorted file. Is there an argument I can add to the sort command to get this to happen?

Best Answer

The problem is that sort -k1 will not sort according to the first field but from the first field to the end of the line. From man sort (emphasis mine):

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

So, -k1 is comparing chr6_32609371_I I2 D to chr6_32609371 T C and since I is before T, it is sorting as you see. To get around this, you should tell sort to only take into account the 1st field by passing both a start and an end position:

sort -k1,1 file
Related Question