File:
chromosome position ref alt
chr1 1398 A T
chr1 2980 A C
chr2 3323 C T,A
chr2 3749 T G
chr3 5251 C T,G
chr3 9990 G C,T
chr4 10345 T G
I need to extract full line when column 4 has 2 or more characters separated by comma
Expected Output is:
chr2 3323 C T,A
chr3 5251 C T,G
chr3 9990 G C,T
Best Answer
A couple of other ways to look at this.
Method #1
Since you only are interested in lines if they have more than 2 characters separated by commas you could just
grep
for commas:Method #2
You could use
grep
's PCRE facility. This is wheregrep
can use Perl's regular expression engine to do the matching. It's quite powerful and lets you do a lot of what you can do with Perl fromgrep
.loosely defined
strictly defined
Method #3
Using
awk
. This again is taking advantage of the fact that only the lines with a comma (,
) are of interest, so it just finds them and prints them:loosely defined
more strictly defined
even more strictly defined
This one looks at the contents of the 4th column and checks that it's a letter followed by a comma, followed by another letter.
even more strictly defined
This looks for only a G,A,T, or C followed by a comma, followed by another G,A,T or C.