Ubuntu – Printing pattern of ā€œCā€ character

command linetext processing

I would like to print pattern of Cys residue from each line given in file.tsv. file.tsv has two coloumns as sequenceID and Sequence. from the second column sequence first character "C" should be printed as C, if the next immediate residue is not C then the code should print C#. # should occur only one time for n number of various amino acid occurrence.

So when in Column if "C" is followed by another character I would like to print # after "C". so if sequence column has value DCFRCGHCC then it should print in the third column C#C#CC.

Example input:

c32_g1_i1_ 3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_ 6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX

The output should be three columns: sequenceID, Sequence, Cys pattern

c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C 

Best Answer

The first one-liner / full script parse and convert the file format described in the question; the second full script parses and converts a FASTA file format.


#1

Golfed one-liner:

perl -lane 'my $s;my @m=$F[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@F,$s);print(join(",",@F))' infile

Expanded full script:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");

while(<$in>) {
    my $string;
    my @fields = split(" ");
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($in);

exit

Explanation:

  • The input file is processed line by line;
  • Each line is splitted into two strings, the part before the space and the part after the space;
  • Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
  • The first, second and custom temporary string are printed, comma-separated, followed by a newline;

Sample output:

% cat infile
c32_g1_i1_3 GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_6 AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl -ne 'my $s;my @f=split(" ");my @m=$f[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@f,$s);print(join(",",@f)."\n")' infile
c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#

#2

Expanded full version:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
open(my $tmp, "+>", "tmpfile") || die("Could not create temporary file \"tmpfile\": $!\n");

select($tmp);

while(<$in>) {
    if(/^>/) {
        s/$/ /
    }
    if(my $next = <$in>) {
        if($next !~ /^>/) {
            chomp
        }
        print;
        seek($in, -length($next), 1)
    }
    else {
        print
    }
}

close($in);

seek($tmp, 0, 0);

select(STDOUT);

while(<$tmp>) {
    my $string;
    my @fields = split(/ (?!.* )|\n/);
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($tmp);

unlink("tmpfile");

exit

Explanation:

  • The input file is processed line by line;
  • If the current line starts with a > character, a space is appended to the line; if a following line exists and doesn't start with a > character, the newline character is stripped from the current line; the current line is printed to a temporary file;
  • The temporary file is processed line by line;
  • Each line is splitted into two strings, the part before the last space and the part after the last space;
  • Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
  • The first, second and custom temporary string are printed, comma-separated, followed by a newline;
  • The temporary file is removed;

Sample output:

% cat infile 
>c32_g1_i1_
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
>c32_g1_i1_
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl script.pl infile 
>c32_g1_i1_,3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#C#C#C#C#C#
>c32_g1_i1_,6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#C#C#CC#C#
Related Question