Extracting subset from fasta file

awkbioinformaticstext processing

I have a fasta file which looks like this:

>chr1
ACGGTGTAGTCG
>chr2
ACGTGTATAGCT
>chrUn
ACGTGGATATTT
>chr21
ACGTTGATGAAA
>chrX
GTACGGGGGTGG
>chrUn5
TGATAGCTGTTG

I just want to extract chr1, chr2, chr21, chrX along with their sequences. So the output I want is:

>chr1
ACGGTGTAGTCG
>chr2
ACGTGTATAGCT
>chr21
ACGTTGATGAAA
>chrX
GTACGGGGGTGG

How can I do it in unix command line?

Best Answer

With sed:

sed -n '/^>chr1$\|^>chr2$\|^>chr21$\|^>chrX$/{p;n;p}' file
  • -n suppresses automatic output.
  • /.../ the regular expression to match >chr1, >chr2, >chr21 or >chrX.
  • {p;n;p} if the expression matches, print the line, read the next input line to pattern space, and print that line too.

If it must be awk, it's nearly the same mechanism:

awk '/^>chr1$|^>chr2$|^>chr21$|^>chrX$/{print;getline;print;}' file
Related Question