Grep Command – Troubleshooting Grep Errors

grep

I am using grep to filter out the contents on some patterns(genes in my case).
For more info, here is the earlier link.

Find pattern from one file listed in another

My code(should work) but is not.

 grep -f file1 file2

Here is my subset of genes(file1):

C1QTNF3
C5orf22
C5orf28
C5orf34
C5orf38
C5orf42
C5orf49
C5orf51
C5orf64
C6
C7
C9
CAPSL
CARD6
CARTPT
CCDC125
CCDC152
CCL28
CCNB1
CCNO
CCT5
CD180
CDC20B
CDH10
CDH12
CDH18
CDH6
CDH9
CDK7
CENPH
CENPK
CKMT2
CLPTM1L
CMBL
CMYA5
COL4A3BP
CR749689
CRHBP
CRSP8P
CT49
CTNND2
CWC27
DAB2
DAP
DDX4
DEPDC1B
DHFR
DHX29
DIMT1
DMGDH

And below is my text file(file2) which is getting matched up, even though there is no gene UNC79 in file 1 as seen in SNPEFF_GENE_NAME=UNC79 show to be present in file2.

  AC=3;AF=0.016;AN=186;BaseQRankSum=0.075;DB;DP=292;Dels=0.00;FS=4.271;HaplotypeScore=0.0891;InbreedingCoeff=0.0225;MLEAC=2;MLEAF=0.011;MQ=59.18;MQ0=1;MQRankSum=0.969;QD=13.42;ReadPosRankSum=-0.373;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=23;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=UNC79;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000256339;VQSLOD=9.31;culprit=DP

Hence, The output of grep is the whole text blob from file2.

Below is the complete row from the file, which gives issue.The second column is the gene name.
I don't have this gene in my file1. And so I don't want the output of this particular row. I have 1000 such rows of different genes, which need to be filtered out only for the genes that are in file1.

    intronic    UNC79   14  94062922    94062922    A   G   het 80.54   3   14  94062922    rs183710732 A   G   80.54   PASS    AC=3;AF=0.016;AN=186;BaseQRankSum=0.075;DB;DP=292;Dels=0.00;FS=4.271;HaplotypeScore=0.0891;InbreedingCoeff=0.0225;MLEAC=2;MLEAF=0.011;MQ=59.18;MQ0=1;MQRankSum=0.969;QD=13.42;ReadPosRankSum=-0.373;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=23;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=UNC79;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000256339;VQSLOD=9.31;culprit=DP    GT:AD:DP:GQ:PL  0/1:1,2:3:33:39,0,33

Best Answer

Since your gene names are always in the 2nd column of the file, you can use awk for this:

awk '
    {   ## while reading the first file, save name in the array a
        if(NR==FNR){a[$1]++;} 

        ## If this is the 2nd file
        else{
            ## print if the value of the second column is defined in the array 
            if($2 in a){print}
        }
    }' file1 file2

The same, condensed:

awk '{if(NR==FNR){a[$1]++;}else{if($2 in a){print}}}' file1 file2 

more condensed:

awk '(NR==FNR){a[$1]++}($2 in a){print}' file1 file2 

and truly minimalist (in answer to @Awk):

awk 'NR==FNR{a[$1]}$2 in a' file1 file2