The first one-liner / full script parse and convert the file format described in the question; the second full script parses and converts a FASTA file format.
#1
Golfed one-liner:
perl -lane 'my $s;my @m=$F[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@F,$s);print(join(",",@F))' infile
Expanded full script:
#!/usr/bin/perl
use strict;
use warnings;
@ARGV == 1 || die("Usage: <command> <input_file>\n");
open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
while(<$in>) {
my $string;
my @fields = split(" ");
my @matches = $fields[1] =~ /C.?/g;
foreach(@matches) {
$_ eq "CC" ? $string .= "C" : $string .= "C#"
}
push(@fields, $string);
print(join(",", @fields) . "\n")
}
close($in);
exit
Explanation:
- The input file is processed line by line;
- Each line is splitted into two strings, the part before the space and the part after the space;
- Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
- The first, second and custom temporary string are printed, comma-separated, followed by a newline;
Sample output:
% cat infile
c32_g1_i1_3 GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_6 AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl -ne 'my $s;my @f=split(" ");my @m=$f[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@f,$s);print(join(",",@f)."\n")' infile
c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#
#2
Expanded full version:
#!/usr/bin/perl
use strict;
use warnings;
@ARGV == 1 || die("Usage: <command> <input_file>\n");
open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
open(my $tmp, "+>", "tmpfile") || die("Could not create temporary file \"tmpfile\": $!\n");
select($tmp);
while(<$in>) {
if(/^>/) {
s/$/ /
}
if(my $next = <$in>) {
if($next !~ /^>/) {
chomp
}
print;
seek($in, -length($next), 1)
}
else {
print
}
}
close($in);
seek($tmp, 0, 0);
select(STDOUT);
while(<$tmp>) {
my $string;
my @fields = split(/ (?!.* )|\n/);
my @matches = $fields[1] =~ /C.?/g;
foreach(@matches) {
$_ eq "CC" ? $string .= "C" : $string .= "C#"
}
push(@fields, $string);
print(join(",", @fields) . "\n")
}
close($tmp);
unlink("tmpfile");
exit
Explanation:
- The input file is processed line by line;
- If the current line starts with a
>
character, a space is appended to the line; if a following line exists and doesn't start with a >
character, the newline character is stripped from the current line; the current line is printed to a temporary file;
- The temporary file is processed line by line;
- Each line is splitted into two strings, the part before the last space and the part after the last space;
- Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
- The first, second and custom temporary string are printed, comma-separated, followed by a newline;
- The temporary file is removed;
Sample output:
% cat infile
>c32_g1_i1_
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
>c32_g1_i1_
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl script.pl infile
>c32_g1_i1_,3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#C#C#C#C#C#
>c32_g1_i1_,6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#C#C#CC#C#
Anchor it (^
is start of line) so that the A
is only matched if it's the first character:
$ letter=A; id=MYNEWIDSTRING; sed "/^$letter /s/[^ ]*/$id/2" file
A MYNEWIDSTRING EXTERNAL
B CC32480A3247F84A SYSTEM
C EC2A63F12A63B76C EXTERNAL
by the way, if you want to pass variables to sed
but you need strong quoting remember you can turn quoting on and off while adding double quotes for the variables - ugly but probably best practice in general:
sed '/^'"$letter"' /s/[^ ]*/'"$id"'/2'
Best Answer
The following script should solve your problem:
Don't forget to make it executable using the following command: