I have a large file with 9 columns, tab delimited. This file is about 39MB, with about 250,000 lines. The last column, column 9, has information like this (note that this is all one column – thie spaces in them doesn't denote a new column, but is just the "data" in that column)
TF_binding_site_cage_181208 ZNFN1A2-91741 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_+_149850517
TF_binding_site_cage_181208 ZNFN1A2-92447 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_-_153016326
TF_binding_site_cage_181208 ZNFN1A2-92446 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_-_153016326
TF_binding_site_cage_181208 ZNFN1A2-92445 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_-_153016326
TF_binding_site_cage_181208 SNAI1-3-177789 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_+_52294530
TF_binding_site_cage_181208 SNAI1-3-178434 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_-_52294717
TF_binding_site_cage_181208 SNAI1-3-178161 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_-_52604408
TF_binding_site_cage_181208 SNAI1-3-177489 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_-_52936367
TF_binding_site_cage_181208 MEF2A,C,D-173519 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr8_+_144711658
TF_binding_site_cage_181208 MEF2A,C,D-173496 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr8_-_145085726
TF_binding_site_cage_181208 MEF2A,C,D-172831 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr8_+_145136211
TF_binding_site_cage_181208 MEF2A,C,D-173254 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr9_+_696759
Basically I am looking for the lines that only contain "MEF2*" so in the example above, it would only select the last 4 lines. I also want the entire rows, not just this column.
I've tried awk-ing this, importing to Excel, importing to R but sometimes my method works BUT I am afraid I have no way to "check" if I got all my lines. (The lines with MEF2 in them span a couple of thousand lines so its hard to manually count).
Can someone think of an algorithm that will help me extract these lines with no (very little) margin of error? I know it seems like a basic thing but I'm afraid that my regex skills are not strong enough to extract all the lines.
Best Answer
This will give you all lines whose 9th column matches
MEF2
:Assuming your file is always tab-delimited, this will work and you can rest safely. That's as close to 0 margin of error as you'll ever get.
If, however, you have tried importing into something like R (presumably using
read.table("file",sep="\t")
) and that didn't work, you might have some lines with different numbers of fields (see the end for how to check that). If so, assuming you are always interested in the last field, you can use$(NF)
inawk
to print the last field, no matter how many fields there are:If you still feel the need to check, you can simply extract all lines that match
MEF2
, irrespective of where the match is, then compare the results:Once you have that, you can use
wc
to check if they have the same number of lines. If they don't, find where they differ by runningThat command will print any lines in output2 that are not present in output1. If there are any, most probably they will have
MEF2
somewhere in the line but not in the 9th field. If it's in the 9th field, then you know that your file is not tab separated and there's something wrong with your data.The
awk
above is probably the simplest solution but here are a few others that do the same thing:Perl
sed
(this one might match wrong lines if you have more than 9 fields)grep
If these don't all produce the same output, you know there's an issue with your file. One more thing you can check is to make sure that all your lines have 9 fields. If they don't, you know there's an issue:
The above will print all lines that don't have exactly 9 tab-separated fields. If there is output, the lines it prints are problematic.