AWK Command – Replace Character for Lines Not Starting with ‘>’

awkbioinformaticstext processing

I'm working with sequence data and I stupidly cannot find the correct way to replace "." by "X" in lines not starting with ">" using awk. I really need to use awk and not sed.

I got this far, but simply all "." are replaced in this way:

awk '/^>/ {next} {gsub(/\./,"X")}1' Sfr.pep > Sfr2.pep

Example subdata:

>sequence.1
GTCAGTCAGTCA.GTCAGTCA

Result I want to get:

>sequence.1
GTCAGTCAGTCAXGTCAGTCA

Best Answer

It seems more natural to do this with sed:

sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep

This would match ^> against the current line ("does this line start with a > character?"). If that expression does not match, the y command is used to change each dot in that line to X.

Testing:

$ cat Sfr.pep
>sequence.1
GTCAGTCAGTCA.GTCAGTCA

$ sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep

$ cat Sfr2.pep
>sequence.1
GTCAGTCAGTCAXGTCAGTCA

The main issue with your awk code is that next is executed whenever you come across a fasta header line. This means that you code only produces sequence data, without headers. That sequence data should look ok though, but that would not be much help.

Simply negating the test and dropping the next block (or preceding the next with print) would solve it in awk for you, but, and this is my personal opinion, using the y command in sed is more elegant than using gsub() (or s///g in sed) for transliterating single characters.

Examples:

Modifying FS:

awk -F" +|;|=" '

$3 == "gene" {
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, $10, $6, $7);
}
' data.file

Using split:

awk '
$3 == "gene" {
    split($9, a, ";")
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, substr(a[1], 3), $6, $7);
}
' data.file

OFS and FS:

Output Field Separator (OFS) as tab, and alternative FS inside awk. Also updated FS to include tab:

awk '
BEGIN {
    FS="[ \t]+|;|="
    OFS="\t"
}
$3 == "gene" {
    print $1, $4, $5, $10, $6, $7
}

' data.file

Also see The Open Group Variables and Special Variables, Examples.

Gawk manual – it usually is noted when things are a gawk extension to awk.

Sed/awk replace a specific pattern under another pattern

Sed can handle this quite easily. It's a single "substitute" command, prefixed with an address range. I've added extra spacing for better readability:

sed -e '/^\[ABC\]$/ , /^\[.*\]$/     s/^\(value1=\).*$/\1notbla/'

Without the extra spacing, it's:

sed -e '/^\[ABC\]$/,/^\[.*\]$/s/^\(value1=\).*$/\1notbla/'

You don't really need anchored regexes, but they may be safer in some cases of unusual inputs. A slightly shorter version with unanchored regexes is:

sed -e '/\[ABC\]/,/^\[/s/^\(value1=\).*$/\1notbla/'

Explanation:

You asked for each flag or option to be explained, and I've got the time, so here you go. I'm explaining the final (shortest) version out of the three Sed commands listed above.

The first part of the line is an address range: /startregex/,/stopregex/ The substitute command which follows the address range is only applied to lines from startregex to stopregex (inclusive).

In this case the start regex is /\[ABC\]/. Square brackets are usually special characters within a regex, so we put a backslash before each to signify literal square bracket characters.

The stop regex is /^\[/, which uses the special regex character ^ to signify the start of a line. This pattern will match any line that starts with a literal left square bracket ([).

The substitute command is basically quite simple; the general format is s/findregex/replacetext/. It can also have special flags placed after the final / to modify its behavior, but I'm not using any such flags here.

The "find regex" is ^$value1=$.*$.

The caret (^) matches the start of the line, as mentioned earlier, and the dollar sign ($) matches the end of the line. So this whole pattern must match an entire line, not merely part of one.

The parentheses (()), unlike square brackets, are non-special by default in regexes, so we put the backslashes before them to give them their special meaning. They allow parts of the matched text (the text matched by the "find regex") to be used in the replacement text. Specifically, the \1 in the replacement text means, "The text matched within the first set of parentheses in the regex." In this case, that is always just "value1=".

The final element in the "find regex" is .*. The dot (.) means "any single character," and the asterisk (*) means "any number of times (zero or more)." So the dot star (.*) matches the entire rest of the line, after the equals sign.

"notbla" in the replacement text is just static text, nothing special about it.

To really learn Sed properly, I highly recommend the Grymoire Sed tutorial, which is free online.

Best Answer

Related Solutions

Remove string from a particular field using awk/sed

Examples:

Sed/awk replace a specific pattern under another pattern

Explanation:

Related Question