Identifying genes from a list of genes

awkgreptext processing

I have a gene list file. Some thing like this

    SWT21
    SSA1
    NRP1
    EFB1
    TFC3
    MDM10

I have another file which also contains the names of these genes in my list along with other essential information about them. The second file looks like this:

chrI    147593  151166  YAL001C -   TFC3
chrI    143706  147531  YAL002W +   VPS8
chrI    142173  143160  YAL003W +   EFB1
chrI    140759  141407  YAL004W +   YAL004W
chrI    139502  141431  YAL005C -   SSA1
chrI    137697  138345  YAL007C -   ERP2
chrI    136913  137510  YAL008W +   FUN14
chrI    135853  136633  YAL009W +   SPO7
chrI    134183  135665  YAL010C -   MDM10

I want to extract those lines in the 2nd file which have gene names as are present in first file.

Best Answer

All you need is a simple grep:

grep -Fwf gene_list.txt gene_info.txt

The options used are:

-w : Search for whole words, this ensures that the gene name ERK1 will not match the gene ERK12 (-w is not a standard option but is fairly common)
-f : Read the patterns to be searched for from a file. In this case gene_list.txt.
-F : Treat the patterns as strings, not as regular expressions. This ensures that a gene name like TOR* (if such a thing existed) would not match TORRRRRR.

NOTE: This assumes that there are no spaces around the gene names in your list. If there are, you will need to remove them first (here with GNU sed):

sed -i 's/ //g' gene_list.txt

grep version

$ grep --version
grep (GNU grep) 2.14
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

Stray characters in F1.txt?

While debugging this further I noticed several stray spaces at the end of the 2nd line in the file F1.txt. You can see them using hexdump.

$ hexdump -C ff1
00000000  45 4e 53 47 30 30 30 30  30 31 38 37 35 34 36 0a  |ENSG00000187546.|
00000010  45 4e 53 47 30 30 30 30  30 31 31 33 34 39 32 20  |ENSG00000113492 |
00000020  20 0a 45 4e 53 47 30 30  30 30 30 31 36 36 39 37  | .ENSG0000016697|
00000030  31 0a                                             |1.|
00000032

They show up with as ASCII codes 20. You can see them in them here: 32 20 20 0a.

Grep Command – Troubleshooting Grep Errors

Since your gene names are always in the 2nd column of the file, you can use awk for this:

awk '
    {   ## while reading the first file, save name in the array a
        if(NR==FNR){a[$1]++;} 

        ## If this is the 2nd file
        else{
            ## print if the value of the second column is defined in the array 
            if($2 in a){print}
        }
    }' file1 file2

The same, condensed:

awk '{if(NR==FNR){a[$1]++;}else{if($2 in a){print}}}' file1 file2

more condensed:

awk '(NR==FNR){a[$1]++}($2 in a){print}' file1 file2

and truly minimalist (in answer to @Awk):

awk 'NR==FNR{a[$1]}$2 in a' file1 file2

Best Answer

Related Solutions

Find Pattern from One File Listed in Another Using Grep

grep version

Stray characters in F1.txt?

Grep Command – Troubleshooting Grep Errors

Related Question