AWK Command – Replace Character for Lines Not Starting with ‘>’

awkbioinformaticstext processing

I'm working with sequence data and I stupidly cannot find the correct way to replace "." by "X" in lines not starting with ">" using awk. I really need to use awk and not sed.

I got this far, but simply all "." are replaced in this way:

awk '/^>/ {next} {gsub(/\./,"X")}1' Sfr.pep > Sfr2.pep

Example subdata:

>sequence.1
GTCAGTCAGTCA.GTCAGTCA

Result I want to get:

>sequence.1
GTCAGTCAGTCAXGTCAGTCA

Best Answer

It seems more natural to do this with sed:

sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep

This would match ^> against the current line ("does this line start with a > character?"). If that expression does not match, the y command is used to change each dot in that line to X.

Testing:

$ cat Sfr.pep
>sequence.1
GTCAGTCAGTCA.GTCAGTCA
$ sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep
$ cat Sfr2.pep
>sequence.1
GTCAGTCAGTCAXGTCAGTCA

The main issue with your awk code is that next is executed whenever you come across a fasta header line. This means that you code only produces sequence data, without headers. That sequence data should look ok though, but that would not be much help.

Simply negating the test and dropping the next block (or preceding the next with print) would solve it in awk for you, but, and this is my personal opinion, using the y command in sed is more elegant than using gsub() (or s///g in sed) for transliterating single characters.