Grep a file on specific field

command linegreptext processing

I have two files, let's say

File1:

Locus_1
Locus_2
Locus_3

File2:

3  3  Locus_1  Locus_40  etc_849    
3  2  Locus_2  Locus_94  *    
2  2  Locus_6  Locus_1  *    
2  3  Locus_3,Locus_4  Locus_50  *    
3  3  Locus_9  Locus_3  etc_667

I want to do a grep -F for the first file only on the third column of the second file (in the original File2 fields are separated by tabs), such as the output should be:

Output:

3  3  Locus_1  Locus_40  etc_849    
3  2  Locus_2  Locus_94  *    
2  3  Locus_3,Locus_4  Locus_50  *

How can I do it?

Edit
To Chaos: no, the comma is not a mistake. I can have more than one Locus_* in a column – and in case the second Locus_* (the one after the comma) matches one of the lines of File1 I want it to be retrieved, too!

Best Answer

If grep is not necessary, one simple solution would be to use join for that:

$ join -1 1 -2 3 <(sort file1) <(sort -k3 file2)
Locus_1 3 3 Locus_40 etc_849
Locus_2 3 2 Locus_94 *
Locus_3 2 3 Locus_4 Locus_50 *

Explanation:

  • join -1 1 -2 3: join the two files where in the first file the first (and only) field is used and in the second file the third field. They are printed when they are equal.
  • <(sort file1): join needs sorted input
  • <(sort -k3 file2): the input must be sorted on the join field (3rd field here)
Related Question