Python – case sensitive substitution; same target ids

perlpythonsed

I am struggled myself to make a case sensitive replacement in a text file. Please find below a segment of my sed file that I am running as
sed -f file.sed < input.txt > output.txt

 s/\<code_229633_13\>/R77_08349T0/
 s/\<code_229633_138\>/R77_09738T0/
 s/\<code_230519_10\>/R77_04813T0/
 s/\<code_230519_1\>/R77_13591T0/
 s/\<code_230519_13\>/R77_05463T0/
 up to line 14521....

The code is working great but I have also cases where I have 2 or more TARGET ids (code_010512_23 and code_299097_0) ovelapping the same REPLACEMENT id (R77_14520T0) and I would like to have as output something like R77_14520T0.a and R77_14520T0.b (lines 1 and 2 below)

s/code_010512_23/R77_14520T0/ --> R77_14520T0.a
s/code_299097_0/R77_14520T0/ --> R77_14520T0.b

Furthermore, a more complex but similar case is when i have the following input file (input2.txt file):

  ID=gene09464;Name=code_229633_13;isoforms=1           
  ID=mRNA10661;Parent=gene09464;Name=code_229633_13         
  ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0  1   1093    +
  ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0  1094    1873    +

  ID=gene09491;Name=code_229633_138;isoforms=1          
  ID=mRNA10690;Parent=gene09491;Name=code_229633_138            
  ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0 1   411 +

  ID=gene09513;Name=code_230519_10;isoforms=1           
  ID=mRNA10715;Parent=gene09513;Name=code_230519_10         
  ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0  1   59  +
  ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0  60  186 +

  ID=gene09511;Name=code_230519_1;isoforms=1            
  ID=mRNA10713;Parent=gene09511;Name=code_230519_1          
  ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0   1   1075    +
  ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0   1076    1128    +

  ID=gene09514;Name=code_230519_13;isoforms=1           
  ID=mRNA10716;Parent=gene09514;Name=code_230519_13         
  ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0  1   219 +

  ID=gene00865;Name=code_010512_23;isoforms=1           
  ID=mRNA00979;Parent=gene00865;Name=code_010512_23         
  ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0  1   143 +

  ID=gene14561;Name=code_299097_0;isoforms=2            
  ID=mRNA16419;Parent=gene14561;Name=code_299097_0          
  ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0   144 193 +
  ID=mRNA16420;Parent=gene14561;Name=code_299097_0          
  ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0   408 457 +

and I need to apply the replacements with the same as the previous way only on the lines which contain the word "isoforms", in other words in lines 1,6,10, 15,20, 24 and 28 and nowhere else in the text. The format of this input file would be exactly as depicted with blank lines among the "isoforms" lines.

My desired output

 ID=gene09464;Name=R77_08349T0;isoforms=1           
 ID=mRNA10661;Parent=gene09464;Name=code_229633_13          
 ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1   1093    +
 ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1094    1873    +
 ID=exon26194;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1874    4065    +

 ID=gene09491;Name=R77_09738T0;isoforms=1           
 ID=mRNA10690;Parent=gene09491;Name=code_229633_138         
 ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0  1   411 +

 ID=gene09513;Name=Target=R77_04813T0;isoforms=1            
 ID=mRNA10715;Parent=gene09513;Name=code_230519_10          
 ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   1   59  +
 ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   60  186 +
 ID=exon26313;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   187 678 +
 ID=exon26314;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   679 1399    +
 ID=exon26315;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   1400    1402    +

 ID=gene09511;Name=R77_13591T0;isoforms=1           
 ID=mRNA10713;Parent=gene09511;Name=code_230519_1           
 ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0    1   1075    +
 ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0    1076    1128    +

 ID=gene09514;Name=R77_05463T0;isoforms=1           
 ID=mRNA10716;Parent=gene09514;Name=code_230519_13          
 ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0   1   219 +

 ID=gene00865;Name=R77_14520T0.a;isoforms=1         
 ID=mRNA00979;Parent=gene00865;Name=code_010512_23          
 ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0   1   143 +

 ID=gene14561;Name=R77_14520T0.b;isoforms=2         
 ID=mRNA16419;Parent=gene14561;Name=code_299097_0           
 ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0    144 193 +
 ID=mRNA16420;Parent=gene14561;Name=code_299097_0           
 ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0    408 457 +

Best Answer

You can't really do this kind of thing with sed, it's just a text stream editor. Try this Perl scriptlet:

#!/usr/bin/env perl 

## Set the record separator to \n\n to
## read multiple lines as a single record
$/="\n\n";
## This array will contain all lines of the file
my @lines=<>;

## The list of suffixes
@suffix=(a..z); 

## For each line of the input file
foreach (@lines) {
    ## If the current line (lines are now the actual multiline records
    ## because we set $/ to consecutive newlines) is one we are interested in.
    if (/isoforms.*?Target=(\S+)/s){
    ## Keep a list of seen targets
    $seen{$1}++;
    }

}
## Now that we have processed the entire file
## go back and print each line.
foreach (@lines) {

    ## If this line is one of the ones we're interested in
    if(/Name=(.+?);.*?isoforms=.*?Target=(\S+)/s){
    $name=$1; $target=$2;
    ## This is needed so we can know whether
    ## how many times we've seen this target so far.
    $newseen{$target}++;
    ## If this target exists more than once in the input file
    if ($seen{$target}>1) {
        ## Use the %newseen hash to choose the right letter.
        ## The -1 is needed because the first element of an
        ## array is 0, not 1.
        s/$name/$target.$suffix[$newseen{$target}-1]/;
    }
    else {
        s/$name/$target/;
    }
    }
    print;
}

Save the script above as foo.pl, make it executable (chmod a+x foo.pl) and run on your input file:

./foo.pl input.txt > output.txt
Related Question