Identifying genes from a list of genes

awkgreptext processing

I have a gene list file. Some thing like this

    SWT21
    SSA1
    NRP1
    EFB1
    TFC3
    MDM10

I have another file which also contains the names of these genes in my list along with other essential information about them. The second file looks like this:

chrI    147593  151166  YAL001C -   TFC3
chrI    143706  147531  YAL002W +   VPS8
chrI    142173  143160  YAL003W +   EFB1
chrI    140759  141407  YAL004W +   YAL004W
chrI    139502  141431  YAL005C -   SSA1
chrI    137697  138345  YAL007C -   ERP2
chrI    136913  137510  YAL008W +   FUN14
chrI    135853  136633  YAL009W +   SPO7
chrI    134183  135665  YAL010C -   MDM10

I want to extract those lines in the 2nd file which have gene names as are present in first file.

Best Answer

All you need is a simple grep:

grep -Fwf gene_list.txt gene_info.txt

The options used are:

  • -w : Search for whole words, this ensures that the gene name ERK1 will not match the gene ERK12 (-w is not a standard option but is fairly common)
  • -f : Read the patterns to be searched for from a file. In this case gene_list.txt.
  • -F : Treat the patterns as strings, not as regular expressions. This ensures that a gene name like TOR* (if such a thing existed) would not match TORRRRRR.

NOTE: This assumes that there are no spaces around the gene names in your list. If there are, you will need to remove them first (here with GNU sed):

sed -i 's/ //g' gene_list.txt
Related Question