Find Pattern from One File Listed in Another Using Grep

grep

I want to find patterns that are listed in one file and find them in other file. The second file has those patterns separated by commas.

for e.g. first file F1 has genes

ENSG00000187546
ENSG00000113492  
ENSG00000166971

and second file F2 has those genes along with some more columns(five columns) which I need

 region     gene           chromosome  start       end

 intronic   ENSG00000135870 1   173921301   173921301
intergenic  ENSG00000166971(dist=56181),ENSG00000103494(dist=37091) 16 53594504    53594504
ncRNA_intronic  ENSG00000215231 5   5039185 5039185
intronic    ENSG00000157890 15  66353740    66353740

So the gene ENSG00000166971 which is present in the second file does not show up in grep because it has another gene with it,separated by comma.

My code is:

grep -f "F1.txt" "F2.txt" >output.txt

I want those values even if one of them is present,and the associated data with it.Is there any way to do this?

Best Answer

What version of grep are you using? I tried your code and got the following results:

$ grep -f file1 file2
ENSG00000187546
ENSG00000113492
ENSG00000166971,ENSG00000186106

If you just want the results that match you can use grep's -o switch to report only the things that match:

$ grep -o -f file1 file2 
ENSG00000187546
ENSG00000113492
ENSG00000166971

grep version

$ grep --version
grep (GNU grep) 2.14
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

Stray characters in F1.txt?

While debugging this further I noticed several stray spaces at the end of the 2nd line in the file F1.txt. You can see them using hexdump.

$ hexdump -C ff1
00000000  45 4e 53 47 30 30 30 30  30 31 38 37 35 34 36 0a  |ENSG00000187546.|
00000010  45 4e 53 47 30 30 30 30  30 31 31 33 34 39 32 20  |ENSG00000113492 |
00000020  20 0a 45 4e 53 47 30 30  30 30 30 31 36 36 39 37  | .ENSG0000016697|
00000030  31 0a                                             |1.|
00000032

They show up with as ASCII codes 20. You can see them in them here: 32 20 20 0a.

Related Solutions

Grep Command – Troubleshooting Grep Errors

Since your gene names are always in the 2nd column of the file, you can use awk for this:

awk '
    {   ## while reading the first file, save name in the array a
        if(NR==FNR){a[$1]++;} 

        ## If this is the 2nd file
        else{
            ## print if the value of the second column is defined in the array 
            if($2 in a){print}
        }
    }' file1 file2

The same, condensed:

awk '{if(NR==FNR){a[$1]++;}else{if($2 in a){print}}}' file1 file2

more condensed:

awk '(NR==FNR){a[$1]++}($2 in a){print}' file1 file2

and truly minimalist (in answer to @Awk):

awk 'NR==FNR{a[$1]}$2 in a' file1 file2

Identifying genes from a list of genes

All you need is a simple grep:

grep -Fwf gene_list.txt gene_info.txt

The options used are:

-w : Search for whole words, this ensures that the gene name ERK1 will not match the gene ERK12 (-w is not a standard option but is fairly common)
-f : Read the patterns to be searched for from a file. In this case gene_list.txt.
-F : Treat the patterns as strings, not as regular expressions. This ensures that a gene name like TOR* (if such a thing existed) would not match TORRRRRR.

NOTE: This assumes that there are no spaces around the gene names in your list. If there are, you will need to remove them first (here with GNU sed):

sed -i 's/ //g' gene_list.txt