Extracting a subset of lines from a large text file

text processing

I have a large file with 9 columns, tab delimited. This file is about 39MB, with about 250,000 lines. The last column, column 9, has information like this (note that this is all one column – thie spaces in them doesn't denote a new column, but is just the "data" in that column)

TF_binding_site_cage_181208 ZNFN1A2-91741 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_+_149850517 
TF_binding_site_cage_181208 ZNFN1A2-92447 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_-_153016326 
TF_binding_site_cage_181208 ZNFN1A2-92446 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_-_153016326 
TF_binding_site_cage_181208 ZNFN1A2-92445 ;ALIAS ZNFN1A2 ;L3_ID L3_chrX_-_153016326 
TF_binding_site_cage_181208 SNAI1-3-177789 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_+_52294530 
TF_binding_site_cage_181208 SNAI1-3-178434 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_-_52294717 
TF_binding_site_cage_181208 SNAI1-3-178161 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_-_52604408 
TF_binding_site_cage_181208 SNAI1-3-177489 ;ALIAS SNAI1,SNAI2,SNAI3 ;L3_ID L3_chr1_-_52936367 
TF_binding_site_cage_181208 MEF2A,C,D-173519 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr8_+_144711658 
TF_binding_site_cage_181208 MEF2A,C,D-173496 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr8_-_145085726 
TF_binding_site_cage_181208 MEF2A,C,D-172831 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr8_+_145136211 
TF_binding_site_cage_181208 MEF2A,C,D-173254 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr9_+_696759 

Basically I am looking for the lines that only contain "MEF2*" so in the example above, it would only select the last 4 lines. I also want the entire rows, not just this column.

I've tried awk-ing this, importing to Excel, importing to R but sometimes my method works BUT I am afraid I have no way to "check" if I got all my lines. (The lines with MEF2 in them span a couple of thousand lines so its hard to manually count).

Can someone think of an algorithm that will help me extract these lines with no (very little) margin of error? I know it seems like a basic thing but I'm afraid that my regex skills are not strong enough to extract all the lines.

Best Answer

This will give you all lines whose 9th column matches MEF2:

awk -F"\t" '$9~/MEF2/' file > output

Assuming your file is always tab-delimited, this will work and you can rest safely. That's as close to 0 margin of error as you'll ever get.

If, however, you have tried importing into something like R (presumably using read.table("file",sep="\t")) and that didn't work, you might have some lines with different numbers of fields (see the end for how to check that). If so, assuming you are always interested in the last field, you can use $(NF) in awk to print the last field, no matter how many fields there are:

awk -F"\t" '$(NF)~/MEF2/' file > output

If you still feel the need to check, you can simply extract all lines that match MEF2, irrespective of where the match is, then compare the results:

grep MEF2 file > output2

Once you have that, you can use wc to check if they have the same number of lines. If they don't, find where they differ by running

grep -vFf output output2

That command will print any lines in output2 that are not present in output1. If there are any, most probably they will have MEF2 somewhere in the line but not in the 9th field. If it's in the 9th field, then you know that your file is not tab separated and there's something wrong with your data.


The awk above is probably the simplest solution but here are a few others that do the same thing:

  • Perl

    perl -F"\t" -lane '$F[8]=~/MEF2/ && print' file
    
  • sed (this one might match wrong lines if you have more than 9 fields)

    sed -n '/\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*/p' file
    
  • grep

    grep -P '^.+?\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*' file
    

If these don't all produce the same output, you know there's an issue with your file. One more thing you can check is to make sure that all your lines have 9 fields. If they don't, you know there's an issue:

awk -F"\t" 'NF!=9' file

The above will print all lines that don't have exactly 9 tab-separated fields. If there is output, the lines it prints are problematic.

Related Question