Understanding an awk formula that unwraps fasta files

awkbioinformatics

I have just found a formula which can be used to unwrap fasta files. Before I give the formula, I need to explain what unwrapping a fasta file is.
In short, the fasta format is like this:

>name_of_sequence$
xxxxxxxxxxxxxxxxxxxxxx$
>name_of_sequence_2$
xxxxxxxxxxxxxxxxxxxxxx$
>name_of_sequence_3$
xxxxxxxxxxxxxxxxxxxxxx$

This would be a normal fasta file as I only have one line per sequence (xxxxxx…). The dollar sign are line breaks.

Sometimes however, you can find wrapped fasta files like this:

>name_of_sequence$
xxxxxxxxx$
xxxxxxxxx$
xxxx$
>name_of_sequence_2$
xxxxxxxxx$
xxxxxxxxx$
xxxx$
>name_of_sequence_3$
xxxxxxxxx$
xxxxxxxxx$
xxxx$

Here, you still only have three sequences but each of them are broken into three parts.
Unwrapping a fasta file means to convert the latter format to the former (one line per sequence).

To do this, you need to remove line breaks from the latter file but not all of them. You would need to keep the line break after the name of the sequence (e.g. here: >name_of_sequence$) and that at the end of the sequence (e.g. here: xxxx$).

It appears that this formula does this:

cat infasta | awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' > outfasta

My question is: Could someone explain to me how it works?

Best Answer

This is your awk script:

/^>/ {
    print s ? s "\n" $0 : $0;
    s = "";
    next;
}

{
    s = s sprintf("%s", $0);
}

END {
    if (s)
      print s;
}

The first block gets triggered only for lines that start with >, i.e. fasta header lines.

In the first block, something gets printed. That something is s ? s "\n" $0 : $0. This means "if s is non-zero (or unset), use s and add a newline to it followed by the whole of the current line, otherwise just use the whole current line". In this program, s will be a partially read sequence belonging to the most recently processed header line, and when the program hits a header line, this print statement will output the last sequence (which is now complete), if there was any, followed by the newly found header line on a new line.

The block then sets s to an empty string (we haven't read any sequence belonging to this header yet), and we skip to the next input line.

The next block is executed for all lines of input (but not for header lines as these will be skipped due to the next in the previous block). It simply appends the current line to s. sprintf is used, but I'm not quite sure why (s = s $0 would probably work too).

The last block will be executed after having read all lines of input. It will print the sequence belonging to the last header line, if there was any.

Summary:

The awk script concatenates all separate sequence lines by saving them in a variable. When a header line is found, it output the sequence read so far together with the new header on a line of its own. At the end, the sequence belonging to the last header is outputted.

Alternative awk script that doesn't store sequence in a variable (may be useful if you have very large genomes in your fasta files):

/^>/ {
    if (NR == 1) {
        print;  # 1st header line, just print it.
    } else {
        # Print a newline for the prev. sequence, then the header line on its own line.
        printf("\n%s\n", $0);
    }
    next; # Skip to next input line.
}

{
    printf("%s", $0); # Print sequence without newline.
}

END {
    printf("\n"); # Add final newline to output.
}

As a "one-liner":

awk '/^>/{if(NR==1){print}else{printf("\n%s\n",$0)}next} {printf("%s",$0)} END{printf("\n")}' sequence.fasta

Related Solutions

AWK and KSH – Wrapping Comma Delimited Values to Next Line

You could use sed like this:

    sed 'h;s/,[^|]*//g;x
    /,/{s/|[^,|]*,*/|-/g;H;}
    x;s/-\([^|]\)/\1/g;P;D'

It wound up being relatively simple after all. Applying that little script to your data gets:

key1|0|11881|0|0|0|0|11769|0|0|0
key2|2027|345|0|1|0|2040|364|0|1|0
key2|-|712|-|-|-|-|729|-|-|-
key3|0|670944|0|0|0|0|495554|0|0|0
key4|1847|1|0|0|0|1814|1|0|0|0
key4|-|21|-|-|-|-|22|-|-|-
key5|1880|11|0|154|0|1886|11|0|151|0
key5|-|402|-|-|-|-|397|-|-|-
key6|1|65|0|8|0|16684|51|0|8|0
key6|1|4570|-|-|-|0|4176|-|-|-
key6|19137|-|-|-|-|-|-|-|-|-
key7|1851|11|0|202|0|1856|13|0|193|0
key7|-|757|-|-|-|-|751|-|-|-

Basically sed just tackles each field from both ends. It first saves a copy of its current iteration to hold space. Then sed removes everything from every field following the first comma on. After that sed switches back to its saved copy so it can remove the field it just saved in the other buffer.

If commas remain it appends the second copy to the first following an inserted \newline character so it can recurse at least once more when P;D Print then Delete only up to the first occurring \newline character in pattern space before starting over with what remains.

awk – Multiline Regexp with Grep, Sed, Awk, and Perl

You can do this with Awk by setting the "Record Separator" variable to be a regex matching at least two consecutive newline characters:

awk -v RS='\n\n+' '/1.*2.*3/' file.txt

You can also set the "Field Separator" to be a single newline character:

awk -v RS='\n\n+' -F '\n' '$1 == "LINE OF TEXT 1" && $2 == "LINE OF TEXT 2" && $3 == "LINE OF TEXT 3"' file.txt

Broken up for readability:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3"
' file.txt

With your requirement of only printing the filename if a match is found, you can do this like so:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3" {
    match++
  }
  END {
    if (match) {
      print FILENAME
    }
' file.txt

But considering you are talking about using find in combination with awk, I'd recommend just using Awk for the exit status and using find for the printing:

find . -type f -exec awk -v RS='\n\n+' -F '\n' '
  $1 ~ /LINE OF TEXT 1/ &&
  $2 ~ /LINE OF TEXT 2/ &&
  $3 ~ /LINE OF TEXT 3/ {
    exit 0
  }
  END { exit 1 }
' {} \; -print

That way, if you want to do something else before printing (some other find primary), you're already set up to do so.

Best Answer

Related Solutions

AWK and KSH – Wrapping Comma Delimited Values to Next Line

awk – Multiline Regexp with Grep, Sed, Awk, and Perl

Related Question