Extracting subset from fasta file

awkbioinformaticstext processing

I have a fasta file which looks like this:

>chr1
ACGGTGTAGTCG
>chr2
ACGTGTATAGCT
>chrUn
ACGTGGATATTT
>chr21
ACGTTGATGAAA
>chrX
GTACGGGGGTGG
>chrUn5
TGATAGCTGTTG

I just want to extract chr1, chr2, chr21, chrX along with their sequences. So the output I want is:

>chr1
ACGGTGTAGTCG
>chr2
ACGTGTATAGCT
>chr21
ACGTTGATGAAA
>chrX
GTACGGGGGTGG

How can I do it in unix command line?

Best Answer

With sed:

sed -n '/^>chr1$\|^>chr2$\|^>chr21$\|^>chrX$/{p;n;p}' file

-n suppresses automatic output.
/.../ the regular expression to match >chr1, >chr2, >chr21 or >chrX.
{p;n;p} if the expression matches, print the line, read the next input line to pattern space, and print that line too.

If it must be awk, it's nearly the same mechanism:

awk '/^>chr1$|^>chr2$|^>chr21$|^>chrX$/{print;getline;print;}' file

Related Solutions

Understanding an awk formula that unwraps fasta files

This is your awk script:

/^>/ {
    print s ? s "\n" $0 : $0;
    s = "";
    next;
}

{
    s = s sprintf("%s", $0);
}

END {
    if (s)
      print s;
}

The first block gets triggered only for lines that start with >, i.e. fasta header lines.

In the first block, something gets printed. That something is s ? s "\n" $0 : $0. This means "if s is non-zero (or unset), use s and add a newline to it followed by the whole of the current line, otherwise just use the whole current line". In this program, s will be a partially read sequence belonging to the most recently processed header line, and when the program hits a header line, this print statement will output the last sequence (which is now complete), if there was any, followed by the newly found header line on a new line.

The block then sets s to an empty string (we haven't read any sequence belonging to this header yet), and we skip to the next input line.

The next block is executed for all lines of input (but not for header lines as these will be skipped due to the next in the previous block). It simply appends the current line to s. sprintf is used, but I'm not quite sure why (s = s $0 would probably work too).

The last block will be executed after having read all lines of input. It will print the sequence belonging to the last header line, if there was any.

Summary:

The awk script concatenates all separate sequence lines by saving them in a variable. When a header line is found, it output the sequence read so far together with the new header on a line of its own. At the end, the sequence belonging to the last header is outputted.

Alternative awk script that doesn't store sequence in a variable (may be useful if you have very large genomes in your fasta files):

/^>/ {
    if (NR == 1) {
        print;  # 1st header line, just print it.
    } else {
        # Print a newline for the prev. sequence, then the header line on its own line.
        printf("\n%s\n", $0);
    }
    next; # Skip to next input line.
}

{
    printf("%s", $0); # Print sequence without newline.
}

END {
    printf("\n"); # Add final newline to output.
}

As a "one-liner":

awk '/^>/{if(NR==1){print}else{printf("\n%s\n",$0)}next} {printf("%s",$0)} END{printf("\n")}' sequence.fasta

Best Answer

Related Solutions

Understanding an awk formula that unwraps fasta files

Related Question