Join Sorted Files Error – How to Fix

join;linuxsort

I would like to merge a variable from one file to another in linux.
The first variable contains the name I want to merge files on.

I have sorted both files using both -f and -k:
sort -f -k 1,1 SCZ.N.tmp> SCZ.N.tmp.sorted and sort -f -k 1,1 1kg.tmp > 1kG.ref_file.sorted

However, when I join both files with this command: join -1 1 -2 1 SCZ.N.tmp.sorted 1kG.ref_file.sorted> SCZ.freq.joined

I keep getting the error 'join: SCZ.N.tmp.sorted:112855: is not sorted: chr1_100002155_D D I6 0.995112 0.0184 0.7897 87016' Nevertheless, the join continues and the majority is merged. However, I am not sure whether I am losing a small proportion of cases because of mismatch between the files, or because something goes wrong with sorting these files.

Does anybody know what I am doing wrong? And what i can do to not get this error?
Thank you!

I have also tried: LANG=en_EN sort -f -k 1,1 SCZ.N.tmp> SCZ.N.tmp.sorted2 and LANG=en_EN sort -f -k 1,1 1kg.tmp > 1kg.tmp.sorted2, with then joining using: LANG=en_EN join -1 1 -2 1 SCZ.N.tmp.sorted2 1kg.tmp.sorted2> SCZ.freq.joined. But that did not solve it.

Best Answer

You are sorting the files with the -f option, as case-independent keys.

However, join expects the keys in normal sorted sequence.

You should add the -i option to the command-line for join, to have it ignore case differences.

Alternatively, omit the -f option from both sorts.

Edit: also found another possibility here. The field separators need to be identical for the sort and the join. It looks like the defaults for sort and join are both whitespace, but it may be the next hurdle.

Related Question