I am using grep to filter out the contents on some patterns(genes in my case).
For more info, here is the earlier link.
Find pattern from one file listed in another
My code(should work) but is not.
grep -f file1 file2
Here is my subset of genes(file1):
C1QTNF3
C5orf22
C5orf28
C5orf34
C5orf38
C5orf42
C5orf49
C5orf51
C5orf64
C6
C7
C9
CAPSL
CARD6
CARTPT
CCDC125
CCDC152
CCL28
CCNB1
CCNO
CCT5
CD180
CDC20B
CDH10
CDH12
CDH18
CDH6
CDH9
CDK7
CENPH
CENPK
CKMT2
CLPTM1L
CMBL
CMYA5
COL4A3BP
CR749689
CRHBP
CRSP8P
CT49
CTNND2
CWC27
DAB2
DAP
DDX4
DEPDC1B
DHFR
DHX29
DIMT1
DMGDH
And below is my text file(file2) which is getting matched up, even though there is no gene UNC79 in file 1 as seen in SNPEFF_GENE_NAME=UNC79 show to be present in file2.
AC=3;AF=0.016;AN=186;BaseQRankSum=0.075;DB;DP=292;Dels=0.00;FS=4.271;HaplotypeScore=0.0891;InbreedingCoeff=0.0225;MLEAC=2;MLEAF=0.011;MQ=59.18;MQ0=1;MQRankSum=0.969;QD=13.42;ReadPosRankSum=-0.373;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=23;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=UNC79;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000256339;VQSLOD=9.31;culprit=DP
Hence, The output of grep is the whole text blob from file2.
Below is the complete row from the file, which gives issue.The second column is the gene name.
I don't have this gene in my file1. And so I don't want the output of this particular row. I have 1000 such rows of different genes, which need to be filtered out only for the genes that are in file1.
intronic UNC79 14 94062922 94062922 A G het 80.54 3 14 94062922 rs183710732 A G 80.54 PASS AC=3;AF=0.016;AN=186;BaseQRankSum=0.075;DB;DP=292;Dels=0.00;FS=4.271;HaplotypeScore=0.0891;InbreedingCoeff=0.0225;MLEAC=2;MLEAF=0.011;MQ=59.18;MQ0=1;MQRankSum=0.969;QD=13.42;ReadPosRankSum=-0.373;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=23;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=UNC79;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000256339;VQSLOD=9.31;culprit=DP GT:AD:DP:GQ:PL 0/1:1,2:3:33:39,0,33
Best Answer
Since your gene names are always in the 2nd column of the file, you can use
awk
for this:The same, condensed:
more condensed:
and truly minimalist (in answer to @Awk):