Ubuntu – How to extract multiple bits of information that appear on different lines within the same text file

command lineextracttext processing

I am trying to extract the sequence ID and cluster number that occur on different lines within the same text file.

The input looks like

>Cluster 72
0   319aa, >O311_01007... *
>Cluster 73
0   318aa, >1494_00753... *
1   318aa, >1621_00002... at 99.69%
2   318aa, >1622_00575... at 99.37%
3   318aa, >1633_00422... at 99.37%
4   318aa, >O136_00307... at 99.69%
>Cluster 74
0   318aa, >O139_01028... *
1   318aa, >O142_00961... at 99.69%
>Cluster 75
0   318aa, >O300_00856... *

The desired output is the sequence ID in one column and the corresponding cluster number in the second.

>O311_01007  72
>1494_00753  73
>1621_00002  73
>1622_00575  73
>1633_00422  73
>O136_00307  73
>O139_01028  74
>O142_00961  74
>O300_00856  75

Can anyone help with this?

Best Answer

With awk:

awk -F '[. ]*' 'NF == 2 {id = $2; next} {print $3, id}' input-file

we split fields on spaces or periods with -F '[. ]*'
with lines of two fields, (the >Cluster lines), save the second field as the ID and move to the next line
with other lines, print the third field and the saved ID

#1

Golfed one-liner:

perl -lane 'my $s;my @m=$F[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@F,$s);print(join(",",@F))' infile

Expanded full script:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");

while(<$in>) {
    my $string;
    my @fields = split(" ");
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($in);

exit

Explanation:

The input file is processed line by line;
Each line is splitted into two strings, the part before the space and the part after the space;
Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
The first, second and custom temporary string are printed, comma-separated, followed by a newline;

Sample output:

% cat infile
c32_g1_i1_3 GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_6 AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl -ne 'my $s;my @f=split(" ");my @m=$f[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@f,$s);print(join(",",@f)."\n")' infile
c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#

#2

Expanded full version:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
open(my $tmp, "+>", "tmpfile") || die("Could not create temporary file \"tmpfile\": $!\n");

select($tmp);

while(<$in>) {
    if(/^>/) {
        s/$/ /
    }
    if(my $next = <$in>) {
        if($next !~ /^>/) {
            chomp
        }
        print;
        seek($in, -length($next), 1)
    }
    else {
        print
    }
}

close($in);

seek($tmp, 0, 0);

select(STDOUT);

while(<$tmp>) {
    my $string;
    my @fields = split(/ (?!.* )|\n/);
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($tmp);

unlink("tmpfile");

exit

Explanation:

The input file is processed line by line;
If the current line starts with a > character, a space is appended to the line; if a following line exists and doesn't start with a > character, the newline character is stripped from the current line; the current line is printed to a temporary file;
The temporary file is processed line by line;
Each line is splitted into two strings, the part before the last space and the part after the last space;
Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
The first, second and custom temporary string are printed, comma-separated, followed by a newline;
The temporary file is removed;

Sample output:

% cat infile 
>c32_g1_i1_
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
>c32_g1_i1_
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl script.pl infile 
>c32_g1_i1_,3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#C#C#C#C#C#
>c32_g1_i1_,6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#C#C#CC#C#

Best Answer

Related Solutions

Ubuntu – Access individual lines of a text file and create separate text file with that name

Ubuntu – Printing pattern of “C” character

#1

#2

Related Question