Understanding an awk formula that unwraps fasta files

awkbioinformatics

I have just found a formula which can be used to unwrap fasta files. Before I give the formula, I need to explain what unwrapping a fasta file is.
In short, the fasta format is like this:

>name_of_sequence$
xxxxxxxxxxxxxxxxxxxxxx$
>name_of_sequence_2$
xxxxxxxxxxxxxxxxxxxxxx$
>name_of_sequence_3$
xxxxxxxxxxxxxxxxxxxxxx$

This would be a normal fasta file as I only have one line per sequence (xxxxxx…). The dollar sign are line breaks.

Sometimes however, you can find wrapped fasta files like this:

>name_of_sequence$
xxxxxxxxx$
xxxxxxxxx$
xxxx$
>name_of_sequence_2$
xxxxxxxxx$
xxxxxxxxx$
xxxx$
>name_of_sequence_3$
xxxxxxxxx$
xxxxxxxxx$
xxxx$

Here, you still only have three sequences but each of them are broken into three parts.
Unwrapping a fasta file means to convert the latter format to the former (one line per sequence).

To do this, you need to remove line breaks from the latter file but not all of them. You would need to keep the line break after the name of the sequence (e.g. here: >name_of_sequence$) and that at the end of the sequence (e.g. here: xxxx$).

It appears that this formula does this:

cat infasta | awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' > outfasta

My question is: Could someone explain to me how it works?

Best Answer

This is your awk script:

/^>/ {
    print s ? s "\n" $0 : $0;
    s = "";
    next;
}

{
    s = s sprintf("%s", $0);
}

END {
    if (s)
      print s;
}

The first block gets triggered only for lines that start with >, i.e. fasta header lines.

In the first block, something gets printed. That something is s ? s "\n" $0 : $0. This means "if s is non-zero (or unset), use s and add a newline to it followed by the whole of the current line, otherwise just use the whole current line". In this program, s will be a partially read sequence belonging to the most recently processed header line, and when the program hits a header line, this print statement will output the last sequence (which is now complete), if there was any, followed by the newly found header line on a new line.

The block then sets s to an empty string (we haven't read any sequence belonging to this header yet), and we skip to the next input line.

The next block is executed for all lines of input (but not for header lines as these will be skipped due to the next in the previous block). It simply appends the current line to s. sprintf is used, but I'm not quite sure why (s = s $0 would probably work too).

The last block will be executed after having read all lines of input. It will print the sequence belonging to the last header line, if there was any.

Summary:

The awk script concatenates all separate sequence lines by saving them in a variable. When a header line is found, it output the sequence read so far together with the new header on a line of its own. At the end, the sequence belonging to the last header is outputted.


Alternative awk script that doesn't store sequence in a variable (may be useful if you have very large genomes in your fasta files):

/^>/ {
    if (NR == 1) {
        print;  # 1st header line, just print it.
    } else {
        # Print a newline for the prev. sequence, then the header line on its own line.
        printf("\n%s\n", $0);
    }
    next; # Skip to next input line.
}

{
    printf("%s", $0); # Print sequence without newline.
}

END {
    printf("\n"); # Add final newline to output.
}

As a "one-liner":

awk '/^>/{if(NR==1){print}else{printf("\n%s\n",$0)}next} {printf("%s",$0)} END{printf("\n")}' sequence.fasta