sed and awk – How to Extract Line from File on Specific Condition

awksed

File:

chromosome  position  ref  alt 
chr1          1398     A    T 
chr1          2980     A    C 
chr2          3323     C    T,A
chr2          3749     T    G
chr3          5251     C    T,G
chr3          9990     G    C,T
chr4          10345    T    G

I need to extract full line when column 4 has 2 or more characters separated by comma

Expected Output is:

chr2          3323     C    T,A
chr3          5251     C    T,G
chr3          9990     G    C,T

Best Answer

A couple of other ways to look at this.

Method #1

Since you only are interested in lines if they have more than 2 characters separated by commas you could just grep for commas:

$ grep "," sample.txt 
chr2          3323     C    T,A
chr3          5251     C    T,G
chr3          9990     G    C,T

Method #2

You could use grep's PCRE facility. This is where grep can use Perl's regular expression engine to do the matching. It's quite powerful and lets you do a lot of what you can do with Perl from grep.

loosely defined

$ grep -P "(\w,)+" sample.txt

strictly defined

$ grep -P '\w+\d\s+\d+\s+\w\s+(\w,)+' sample.txt

Method #3

Using awk. This again is taking advantage of the fact that only the lines with a comma (,) are of interest, so it just finds them and prints them:

loosely defined

$ awk '/,/{print}' sample.txt

more strictly defined

$ awk '/([[:alpha:]])+,[[:alpha:]]/{print}' sample.txt

even more strictly defined

$ awk '$4 ~ /([[:alpha:]])+,[[:alpha:]]/{print}' sample.txt

This one looks at the contents of the 4th column and checks that it's a letter followed by a comma, followed by another letter.

even more strictly defined

$ awk '$4 ~ /([GATC])+,[GATC]/{print}' sample.txt

This looks for only a G,A,T, or C followed by a comma, followed by another G,A,T or C.

Related Solutions

How to both extract a specific line in a text file as well as multiple lines containing a specific string

Just change the grep output to append,

grep "string" source.txt >> destination.txt

Extract Strings from First Column of a File – Text Processing Guide

You can use the following awk program:

awk -F' *|' 'NR==FNR{searchstr[$1]=1} NR>FNR && ($1 in searchstr) {print}' string.txt masterFile.list

As you can see, you provide both files as arguments to awk.

While the first file is processed (indicated by FNR, the per-file line-counter, being equal to NR, the global line counter), we simply register all search strings (field nr. 1 of each line, since they are the only items) in an array searchstr (however, in form of an array index, so the "value" is just a dummy value of 1).
When we come to the second file (NR is now greater than FNR), we check if the first column ($1) is contained as an array index in searchstr. If so, we print the entire line.

The idea behind this is that awk has a convenient syntax string in array which is true if string is in the list of array indices of array.

As noted by Ed Morton, you can "golf" this into

awk -F' *|' 'NR==FNR{searchstr[$1]; next} $1 in searchstr' string.txt masterFile.list

The searchstr[$1] call will define (but not fill) that array entry, and the $1 in searchstr outside of the rule block will - if evaluating to true - instruct awk to print the current line. The next instruction in the rule for processing string.txt will ensure that this part is only reached for masterFile.list

Note that I specified a full regular expression ( *|, i.e. any amount of space, followed by |) as field separator in order to ensure that the "first field" of masterFile.list really is only the number - specifying -F'|' would have meant that trailing space is included, too, and would have made the matching process more involved. If the "spaces" can actually also contain TABs, use -F'[[:space:]]*|' instead.