This is your awk
script:
/^>/ {
print s ? s "\n" $0 : $0;
s = "";
next;
}
{
s = s sprintf("%s", $0);
}
END {
if (s)
print s;
}
The first block gets triggered only for lines that start with >
, i.e. fasta header lines.
In the first block, something gets printed. That something is s ? s "\n" $0 : $0
. This means "if s
is non-zero (or unset), use s
and add a newline to it followed by the whole of the current line, otherwise just use the whole current line". In this program, s
will be a partially read sequence belonging to the most recently processed header line, and when the program hits a header line, this print
statement will output the last sequence (which is now complete), if there was any, followed by the newly found header line on a new line.
The block then sets s
to an empty string (we haven't read any sequence belonging to this header yet), and we skip to the next input line.
The next block is executed for all lines of input (but not for header lines as these will be skipped due to the next
in the previous block). It simply appends the current line to s
. sprintf
is used, but I'm not quite sure why (s = s $0
would probably work too).
The last block will be executed after having read all lines of input. It will print the sequence belonging to the last header line, if there was any.
Summary:
The awk
script concatenates all separate sequence lines by saving them in a variable. When a header line is found, it output the sequence read so far together with the new header on a line of its own. At the end, the sequence belonging to the last header is outputted.
Alternative awk
script that doesn't store sequence in a variable (may be useful if you have very large genomes in your fasta files):
/^>/ {
if (NR == 1) {
print; # 1st header line, just print it.
} else {
# Print a newline for the prev. sequence, then the header line on its own line.
printf("\n%s\n", $0);
}
next; # Skip to next input line.
}
{
printf("%s", $0); # Print sequence without newline.
}
END {
printf("\n"); # Add final newline to output.
}
As a "one-liner":
awk '/^>/{if(NR==1){print}else{printf("\n%s\n",$0)}next} {printf("%s",$0)} END{printf("\n")}' sequence.fasta
Best Answer
With
sed
:-n
suppresses automatic output./.../
the regular expression to match>chr1
,>chr2
,>chr21
or>chrX
.{p;n;p}
if the expression matches, print the line, read the next input line to pattern space, and print that line too.If it must be
awk
, it's nearly the same mechanism: