The first one-liner / full script parse and convert the file format described in the question; the second full script parses and converts a FASTA file format.
#1
Golfed one-liner:
perl -lane 'my $s;my @m=$F[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@F,$s);print(join(",",@F))' infile
Expanded full script:
#!/usr/bin/perl
use strict;
use warnings;
@ARGV == 1 || die("Usage: <command> <input_file>\n");
open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
while(<$in>) {
my $string;
my @fields = split(" ");
my @matches = $fields[1] =~ /C.?/g;
foreach(@matches) {
$_ eq "CC" ? $string .= "C" : $string .= "C#"
}
push(@fields, $string);
print(join(",", @fields) . "\n")
}
close($in);
exit
Explanation:
- The input file is processed line by line;
- Each line is splitted into two strings, the part before the space and the part after the space;
- Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
- The first, second and custom temporary string are printed, comma-separated, followed by a newline;
Sample output:
% cat infile
c32_g1_i1_3 GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_6 AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl -ne 'my $s;my @f=split(" ");my @m=$f[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@f,$s);print(join(",",@f)."\n")' infile
c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#
#2
Expanded full version:
#!/usr/bin/perl
use strict;
use warnings;
@ARGV == 1 || die("Usage: <command> <input_file>\n");
open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
open(my $tmp, "+>", "tmpfile") || die("Could not create temporary file \"tmpfile\": $!\n");
select($tmp);
while(<$in>) {
if(/^>/) {
s/$/ /
}
if(my $next = <$in>) {
if($next !~ /^>/) {
chomp
}
print;
seek($in, -length($next), 1)
}
else {
print
}
}
close($in);
seek($tmp, 0, 0);
select(STDOUT);
while(<$tmp>) {
my $string;
my @fields = split(/ (?!.* )|\n/);
my @matches = $fields[1] =~ /C.?/g;
foreach(@matches) {
$_ eq "CC" ? $string .= "C" : $string .= "C#"
}
push(@fields, $string);
print(join(",", @fields) . "\n")
}
close($tmp);
unlink("tmpfile");
exit
Explanation:
- The input file is processed line by line;
- If the current line starts with a
>
character, a space is appended to the line; if a following line exists and doesn't start with a >
character, the newline character is stripped from the current line; the current line is printed to a temporary file;
- The temporary file is processed line by line;
- Each line is splitted into two strings, the part before the last space and the part after the last space;
- Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
- The first, second and custom temporary string are printed, comma-separated, followed by a newline;
- The temporary file is removed;
Sample output:
% cat infile
>c32_g1_i1_
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
>c32_g1_i1_
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl script.pl infile
>c32_g1_i1_,3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#C#C#C#C#C#
>c32_g1_i1_,6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#C#C#CC#C#
Using Perl + ssconvert
(in the gnumeric
package):
perl -F'\012' -00ane 'BEGIN {$, = ","; $\ = "\n"; print("ID,Name,Age,Education")} my @f; foreach(@F) {s/.*?: +//; push(@f, $_)} print(@f)' test1.txt test2.txt | ssconvert fd://0 output.xls
The Perl command reads test1.txt
and test2.txt
using blank lines as record separators and a newline characters as field separators; it prints the header (Id,Name,Age,Education
) and for each record and for each field strips everything before the first character following the sequence of spaces following the first :
character from each field and prints the record using commas as field separators and a newline character as record separators (i.e., it converts test1.txt
and test2.txt
to a CSV):
% cat test1.txt
ID : 1
Name: xxxx
Age: 33
Education: Mtech
ID: 2
Name: yyyy
Age: 22
Education: bsc
% cat test2.txt
ID : 3
Name: xxxx
Age: 33
Education: Mtech
ID: 4
Name: yyyy
Age: 22
Education: bsc
% perl -F'\012' -00ane 'BEGIN {$, = ","; $\ = "\n"; print("ID,Name,Age,Education")} my @f; foreach(@F) {s/.*?: +//; push(@f, $_)} print(@f)' test1.txt test2.txt
ID,Name,Age,Education
1,xxxx,33,Mtech
2,yyyy,22,bsc
3,xxxx,33,Mtech
4,yyyy,22,bsc
The ssconvert
command reads from STDIN and converts the file to an Excel spreadsheet.
If installing gnumeric
to obtain ssconvert
is not an option, you could use just the Perl command and import the CSV into Excel / whatever:
perl -F'\012' -00ane 'BEGIN {$, = ","; $\ = "\n"; print("ID,Name,Age,Education")} my @f; foreach(@F) {s/.*?: +//; push(@f, $_)} print(@f)' test1.txt test2.txt >output.csv
Best Answer
Can be achieved easily with
awk
as follows:result,
at above, first we are reading the fileA and holds the entire column1 from into an array named inFileA, then look in fileB for its first column and if it's matched with the saved column1 from fileA then goes to print entire row of fileB.